How To: OCR TIFF Files During SharePoint 2010 Search Crawl

Did you know that SharePoint 2010 can do optical character recognition (OCR) for TIFF images on the fly for use in search results?   A few simple configuration updates will make it easier to find scanned documents without just relying on metadata or filenames.

Update 3/21/2014: If you are looking to do this with SharePoint 2013 check out the blog post here.

Update 6/2/2011:  If you are using FAST Search for SharePoint 2010 then you will want to follow the steps located here.

Windows 7 and Windows Server 2008 R2 provides a TIFF IFilter which can perform OCR on the TIFF  images.   To utilize this feature in SharePoint 2010 you only need to enable the feature in Windows and then make a small group policy update to the server.

To install Windows TIFF IFilter with Windows Server 2008 R2:

1. Click Start, click All Programs, click Administrative Tools, and then click Server Manager.

2. In the console tree of Server Manager, click Features, and then in Feature Summary, click Add Features.

3. Click Features, and then click Add Features.

4. Select the Windows TIFF IFilter check box, and then click Next.

5. Click Install.

After the IFilter has been installed the following group policy changes must be made in order to set the OCR recognition languages and behavior.

To set preferred OCR languages:

1. Open the Local Group Policy Editor as follows: Click Start, type gpedit.msc in the Start Search text box, and then press ENTER.

2. Under Computer Configuration, expand Administrative Templates.

3. Expand Windows Components,expand Search,and then click OCR.

4. Double-click Select OCR languages from a code page.

5. Click Enable, and then select one or more languages.

6. Click OK.

Forcing optical character recognition of every page of a TIFF image document
This setting bypasses Windows TIFF IFilter performance optimization mechanisms that are designed to skip the OCR processing for images that do not contain text.

To force OCR of every page of a TIFF image document

1. Open the Local Group Policy Editor as follows: Click Start, type gpedit.msc in the Start Search text box, and then press ENTER.

2. Under Computer Configuration, expand Administrative Templates.

3. Expand Windows Components, expand Search, and then click OCR.

4. Double-click Force TIFF IFilter to OCR every page in a TIFF document.

5. Click Enable, and then select one or more languages.

6. Click OK.

To ensure that all of the settings are fully applied it is recommended that you reboot the server and then start a full content crawl in SharePoint.   Once the crawl has completed you should be able to search for text located within TIFF images stored in SharePoint.

If you upload additional TIFF images they will not become searchable until SharePoint completes the next crawl.

From my testing it appears that the OCR works well with compressed TIFF files but I have not had much luck with uncompressed TIFF files.   I scanned a document 4 times:

  1. Uncompressed
  2. Compressed Black and White
  3. Compressed Gray-scale
  4. Compressed Full Color

After uploading those images and doing a full crawl I was able to find text in the compressed files but not in the uncompressed file.

One thing to note… OCR is not 100% accurate and depends greatly on the quality of the TIFF image.  Handwriting in a document will have a very low (or zero) recognition rate as will documents that have text on some colored backgrounds.    Even with these limitations, enabling OCR for TIFF images can greatly increase the usability of scanned images in SharePoint.

Leave a comment if you are using this method.  I would like to hear more about what types of TIFF files work best for you.

5 thoughts on “How To: OCR TIFF Files During SharePoint 2010 Search Crawl”

  1. Cool stuff! Thanks Mike. I noticed the TIFF option but never tested it out. Good to know it can work. PDF us the biggie but maybe people scan to TIFF.

  2. For an archive solution we have to deal with quite a lot of TIFF files that need to be searchable. Enabling SharePoint 2010 to search for text in these files reduces the time we need to migrate the archive solution to SharePoint 2010.

    Your steps how to enable this functionality prove to work as expected.

    Thanks Mike!

Leave a Reply