Understanding the Indexing Process
Understanding how the indexing process works is helpful for troubleshooting issues regarding searching and indexing. The sections that follow outline different aspects of this process.
Types of Files Indexed
IFilters, property handlers, and the Windows property system are used to extract text from documents so that they can be indexed. Microsoft provides IFilters and property handlers for many common document formats by default, while installing other Microsoft applications may also install additional IFilters and property handlers to allow indexing of additional properties and content for documents created by these applications. In addition, third-party vendors may provide their own IFilters and property handlers for indexing proprietary document formats.
IFilters and property handlers are selected on the basis of the file's extension. IFilters understand file formats, whereas property handlers typically just understand file properties. For example, files having the extension .txt are scanned using the Plain Text filter, while files having the .doc extension are scanned using the Office filter and files having the .mp3 extension are scanned using the Audio property handler. All of these extensions are additionally scanned with the Windows property system to extract basic properties, such as file name and size. The Plain Text filter emits full-text content only because text files do not have extended properties (metadata). The Office filter, however, emits both full-text content and metadata because .doc files and other Office files can have extended properties such as Title, Subject, Authors, Date Last Saved, and so on.
Table-1 below lists common document formats, their associated file extensions, and the IFilter dynamic-link library (DLL) included in Windows 7 that is used to scan each type of document. (Table-2 then provides similar information for property handlers.) Note that the indexer scans files based on their file extension, not the type of content within the file. For example, a text file named Test.txt will have its contents scanned and indexed by the Plain Text filter, but a text file named Test.doc will not-the Office filter will be used to scan the file and will expect the file to be a .doc file and not a text file.
Note In Windows Vista, just over one hundred different file extensions were excluded by default from being indexed, including .bin, .chk, .log, .manifest, .tmp, and so on. Beginning with Windows 7, however, the indexer no longer excludes any file extensions by default. This change was made because many of these exclusions were no longer needed, while others had a good probability of reducing the relevance of search results. Some of these exclusions had also been in place to deal with performance issues that could arise if files were indexed. For instance, .log files can be updated very frequently, which in Windows Vista would have caused the indexer to index them repeatedly. Support for smart retry indexing, however, which was added in Windows 7, mitigate the impact of this type of issue. For more information concerning smart retry indexing, see the sidebar titled "Direct from the Source: Indexing and Libraries-Hard Disk Drives vs. Removable Storage" later in this tutorial.
Table-1 IFilters Included in Windows 7 by Document Format and File Extension
Document Format | File Extension | IFilter DLL |
Plain Text | .a, .ans, .asc, .asm, .asx, .bas, .bat, .bcp, .c, .cc, .cls, .cmd, .cpp, .cs, .csa, .csv, .cxx, .dbs, .def, .dic, .dos, .dsp, .dsw, .ext, .faq, .fky, .h, .hpp, .hxx, .i, .ibq, .ics, .idl, .idq, .inc, .inf, .ini, .inl, .inx, .jav, .java, .js, .kci, .lgn, .lst, .m3u, .mak, .mk, .odh, .odl, .pl, .prc, .rc, .rc2, .rct, .reg, .rgs, .rul, .s, .scc, .sol, .sql, .tab, .tdl, .tlh, .tli, .trg, .txt, .udf, .udt, .usr, .vbs, .viw, .vspscc, .vsscc, .vssscc, .wri, .wtx | Query.dll |
Rich Text Format (RTF) | .rtf | RTFfilt.dll |
Microsoft Office Document | .doc, .dot, .pot, .pps, .ppt, .xlb, .xlc, .xls, .xlt | Offfilt.dll |
WordPad | .docx, .otd | WordpadFilter.dll |
Multipurpose Internet Mail Extensions (MIME) | .dll | Mimefilt.dll |
Hypertext Markup Language (HTML) | .ascx, .asp, .aspx, .css, .hhc, .hta, .htm, .html, .htt, .htw, .htx, .odc, .shtm, .shtml, .sor, .srf, .stm, .wdp, .vcproj | Nlhtml.dll |
MIME HTML | .mht, .mhtml | Mimefilt.dll |
Extensible Markup Language (XML) | .csproj, .user, .vbproj, .vcproj, .xml, .xsd, .xsl, .xslt | Xmlfilt.dll |
Favorites | .url | ieframe.dll |
Journal | .jnt | Jntfiltr.dll |
XML Paper Specification (XPS) | .dwfx, .easmx, .edrwx, .eprtx, .jtx, .xps | Mscoree.dll |
Table-2 Property Handlers Included in Windows 7 by Document Format and File Extensions
Document Format | File Extension | IFilter DLL |
Contacts | .contact | Wab32.dll |
System | .cpl, .dll, .exe, .ocx, .rll, .sys | Shell32.dll |
Fonts | .fon, .otf, .ttc, .ttf | Shell32.dll |
.Group Shell Extension | .group | Wab32.dll |
Application Reference | .appref-ms | Dfshim.dll |
Audio/Video Media | .3gp, .3gp2, .3gpp, .aac, .adts, .asf, .avi, .dvr-ms, .m1v, .m2t, .m2ts, .m2v, .m4a, .m4b, .m4p, .m4v, .mod, .mov, .mp2, .mp2v, .mp3, .mp4, .mp4v, .mpe, .mpeg, .mpg, .mpv2, .mts, .ts, .tts, .vob, .wav, .wma, .wmv | Mf.dll |
Internet Shortcut | .url | Leframe.dll |
Images | .bmp, .dib, .gif, .ico, .jfif, .jpe, .jpeg, .jpg, .png, .rle, .tif, .tiff, .wdp | PhotoMetadataHandler.dll |
Installer | msi, .msm, .msp, .mst, .pcp | Propsys.dll |
Library Folder | .library-ms | Shell32.dll |
Microsoft XPS | .xps, .dwfx, .easmx, .eadrwx, .eprtx, .jtx | Xpsshhdr.dll |
Microsoft Office Document | .doc, .dot, .pot, .ppt, .xls, .xlt, .msg | Propsys.dll |
Property Labels | .label | Shdocvw.dll |
Search Connector | .searchConnector-ms | Shell32.dll |
Search Folder | .search-ms | Shdocvw.dll |
Shell Messages | .eml, .nws | Inetcomm.dll |
Shortcut | .lnk | Shell32.dll |
Media Center Recorded TV | .wtv | Sbe.dll |
In Windows 7, all of the file types (extensions) listed in Table-2 are enabled for indexing by default. Note, however, that the Plain Text filter will scan files having the extension .txt but not files having the extension .log, even though the filter supports scanning of .log files. To configure the indexer to scan such files using the default filter, see the section "Modifying IFilter Behavior" later in this tutorial.
Two additional (implicit) IFilters and their extensions are not shown in Table-2:
- File Properties filter This filter is used to index the file system properties only of files for which there is no registered IFilter or for which there is a registered IFilter but the user has explicitly gone into Control Panel and selected the Index Properties Only option for the extension. File extensions that use this filter include .cat, .evt, .mig, .msi, .pif, and about 300 other types of files. Note that the File Properties filter isn't really a filter per se, but instead represents the absence of a registered filter for these extensions. In other words, it relies on the File System Protocol Handler to provide the file properties.
- Null filter This filter extracts the same properties as a File Properties filter and is used to deal with backward compatibility issues with older methods for registering IFilters. Again, this is not really a filter per se and relies upon the File System Protocol Handler to provide the file properties. The file extensions that use the Null filter are .386, .aif, .aifc, .aiff, .aps, .art, .asf, .au, .avi, .bin, .bkf, .bmp, .bsc, .cab, .cda, .cgm, .cod, .com, .cpl, .cur, .dbg, .dct, .desklink, .dib, .dl_, .dll, .drv, .emf, .eps, .etp, .ex_, .exe, .exp, .eyb, .fnd, .fnt, .fon, .ghi, .gif, .gz, .hqx, .icm, .ico, .ilk, .imc, .in_, .inv, .jbf, .jfif, .jpe, .jpeg, .jpg, .latex, .lib, .m14, .m1v, .mapimail, .mid, .midi, .mmf, .mov, .movie, .mp2, .mp2v, .mp3, .mpa, .mpe, .mpeg, .mpg, .mpv2, .mv, .mydocs, .ncb, .obj, .oc_, .ocx, .pch, .pdb, .pds, .pic, .pma, .pmc, .pml, .pmr, .png, .psd, .res, .rle, .rmi, .rpc, .rsp, .sbr, .sc2, .scd, .sch, .sit, .snd, .sr_, .sy_, .sym, .sys, .tar, .tgz, .tlb, .tsp, .ttc, .ttf, .url, .vbx, .vxd, .wav, .wax, .wll, .wlt, .wm, .wma, .wmf, .wmp, .wmv, .wmx, .wmz, .wsz, .wvx, .xix, .z, .z96, .zfsendtotarget, and .zip.
Note Beginning with Windows 7, you won't see the name Null Filter in the Indexing Options Control Panel any longer. Instead, extensions that use this IFilter will just be associated with the File Properties Filter. You are able to tell that the Null IFilter is being used for a file extension only if you looked up the appropriate entry in the registry. This change was made in Windows 7 because the name "Null Filter" was confusing to some users.
The Windows Search service can be enhanced by installing the Microsoft Filter Pack, which provides additional IFilters to support critical search scenarios across multiple Microsoft Search products. The Filter Pack includes the following IFilters:
- Metro (.docx, .docm, .pptx, .pptm, .xlsx, .xlsm, .xlsb)
- Visio (.vdx, .vsd, .vss, .vst, .vdx, .vsx, .vtx)
- OneNote (.one)
- Zip (.zip)
These IFilters are designed to provide enhanced search functionality for the following products: SPS2003, MOSS2007, Search Server 2008, Search Server 2008 Express, WSSv3, Exchange Server 2007, SQL Server 2005, SQL Server 2008, and WDS 3.01.
When you install the Filter Pack, the IFilters in the preceding list are installed and
registered with the Windows Search service. Note that the Filter Pack does not
need to be installed if Office 2007 is installed. The Filter Pack is available from
http://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CC&displaylang=en
for both x86 and x64 versions of Windows 7,
Windows Vista, Windows Server 2008 R2, Windows Server 2008, Windows XP, and Windows Server 2003.
In this tutorial:
- Managing Search
- Search and Indexing Enhancements
- Search in Windows XP
- Search in Windows Vista
- Search in Windows 7
- Understanding the Windows Search Versions
- Search Versions Included in Windows 7 and Windows Vista
- Search Versions Included in Windows Server 2008
- Search Versions Available for Earlier Versions of Windows
- How Windows Search Works
- Understanding Search Engine Terminology
- Windows Search Engine Processes
- Enabling the Indexing Service
- Windows Search Engine Architecture
- Understanding the Catalog
- Default System Exclusion Rules
- Understanding the FANCI Attribute
- Default Indexing Scopes
- Initial Configuration
- Understanding the Indexing Process
- Modifying IFilter Behavior
- How Indexing Works
- Rebuilding the index
- Viewing Indexing Progress
- Understanding Remote Search
- Managing Indexin
- Configuring the Index
- Configuring the Index Location Using Group Policy
- Configuring Indexing Scopes and Exclusions Using Group Policy
- Configuring Offline Files Indexing
- Configuring Indexing of Encrypted Files
- Configuring Indexing of Encrypted Files Using Control Panel
- Configuring Indexing of Similar Words
- Configuring Indexing of Text in TIFF Image Documents
- Other Index Policy Settings
- Using Search
- Configuring Search Using Folder Options
- Configuring What to Search
- Configuring How To Search
- Using Start Menu Search
- Searching Libraries
- Advanced Query Syntax
- Using Federated Search
- Deploying Search Connectors
- Troubleshooting Search and Indexing Using the Built-in Troubleshooter