Google has just announced work on OCRopus, which it says it hopes will ‘advance the state of the art in optical character recognition and related technologies.’ OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. ‘The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.’
The project is expected to run for three years and support three Ph.D. students or postdocs. We are announcing a technology preview release of the software under the Apache license (English-only, combining the Tesseract character recognizer with IUPR layout analysis and language modeling tools), with additional recognizers and functionality in future releases.
It would be interesting to learn how this application compares in accuracy and power with commercial OCR systems (which have apparently gotten much better since the days when I used to get very frustrated with Omnipage and the like).