If you're using... Workshare Professional/Compare 8 |
When Workshare compares image-based PDFs that require OCR, the comparison may not be as accurate as when comparing documents that don't require OCR. It often depends on the quality of the image-based PDF that is being compared.
Workshare uses Solid Documents OCR technology. This technology was developed over the last 10 years so, unlike other legacy OCR engines, the development didn't focus on legacy types of documents. For example, output from facsimile machines, dot matrix printouts or manual typewriter documents.
Assumptions:The assumptions behind Workshare OCR include:
- The scan is of a reasonably high resolution and quality - 300dpi is optimal but 200dpi is also reasonable. Anything less than 200dpi will lead to less reliable results. Recent scanning devices are a lot better than they were 10-15 years ago.
- The original document was emitted from either a laser printer or an inkjet printer (not a fax, or a dot matrix, or a typewriter).
- The compression of the page images was done sensibly (lossless compression like ccitt or jbig2 or lossless jpeg/jpeg2000) and not fuzzy jpegs which are good for photographs but not much else.
The vast majority of professional document users are in the same environment - either Windows or Mac and probably Microsoft Office. With this in mind, the focus of Workshare OCR is:
- Round-trip to Word: An emphasis on business documents over say newspaper or magazine articles when it comes to initial image processing and zoning of text, images and graphics.
- Popular fonts: While the engine is not font-specific, the testing and accuracy refinement uses model documents based on a popular font set which includes Arial, Calibri, Times, Cambria, Courier, Consolas and OCR-A. In addition, testing is also against Arial Narrow, Arial Black, Trebuchet, Palatino Linotype, Bookman Old Style and Book Antigua based on observation of frequency of font use in a vast set of customer sample files.
- OCR languages: The progression of adding OCR languages has been led by regional sales of Solid Documents desktop products. The latest additions are 4 Scandinavian languages and Turkish.
For further information about Solid Documents OCR technology, refer to
http://www.soliddocuments.com/solid-ocr.htmLegacy Documents:Despite the initial focus on "modern" documents Solid Documents has also been gradually passing more of the legacy "ISRI" test set (used for an OCR shoot-out between major players in the mid 90's) and their regression testing gets 1000s of the legacy "ISRI" images 100% correct. These test cases are not "modern" documents. They include typewriter output, magazine and newspaper articles, etc. They are also lower quality scans than one would get from a more recent scanning device. Read more about the ISRI legacy tests here:
http://www.expervision.com/testimonial-world-leading-and-champion-ocr/annual-test-of-ocr-accuracy-by-us-department-of-energy-doe-university-of-nevada-las-vegas-unlvFuture:Workshare OCR capabilities will continuously improve as the Solid Documents OCR engine is continuously being improved. Workshare strive for perfection in all areas of comparison and improvements work best when there are specific real world examples to work with. If you have scanned documents that you think would assist in the improvement process, then please share them with us.