I need to check a tonne of pictures to see if they have a keyword on them. Can anyone recommend a good, reliable OCR library? I'll happily sacrifice speed for accuracy.
Asked
Active
Viewed 6.0k times
2 Answers
25
There is no pure Java OCR libraries that have something to do with accuracy. Depending on your budget you may choose something that is not purely Java, but can be called from Java:
- If you have plenty of time but zero budget - your choice is Tesseract. It is definetely the best among open source
- If you have small budget to spend and you only need run this recognition once - Cloud OCR API service would be your best choice. It is based on leading commertial grade OCR engine and offers quite affordable per-project prices. Disclaimer: I work for ABBYY
- In case you will need to run this recognition as ongoing process forever, then you may think that it is economically more efficient to purchase dedicated conversion software, for example this one, it has API and can be called from Java too. But there are actually lot of alternatives, if you are prepared to invest some budget in licensing.
-
3Fyi.. tesseract sux..wayyyy to much preprocessing is needed,tho open source,its better to just spend the $$ u need to do accurate processing.. accurate OCR is just one of those requirements that is "pay to play" – Jeryl Cook Aug 20 '16 at 08:19
2
If you have plans for recognize not Latin or digit symbols then better way find non java library, but select from some (external) tools and use other ways(1) for get your text. On Linux I have used cuneiform(2) via command line interface.
command line interface and pipe, for example.
cuneiform have ported on Linux but I don't know about work command line interface for Windows

Michael Kazarian
- 4,376
- 1
- 21
- 25