16

For the past 3 months I've been trying to train the Tesseract
With identifying a collection of images I've had, due a real lack
of proper documentation, and very high level of complexity I'm starting to
give up on Tesseract as a solution.

I'm looking for an alternative, which would be relatively pain free
for training, I'm not looking to rediscover the wheel here.

If there isn't anything free, I guess paid solutions would
have to do (nothing above 200$)

Asaf
  • 8,106
  • 19
  • 66
  • 116
  • Can you describe your task? Price for commertial OCR may heavily vary depending on volumes, functionality, etc. – Tomato Apr 01 '11 at 15:16
  • scanning an amount of about 200-300 of documents in similar format, and a need to train the OCR engine manually so recognition accuracy would be as close to 100% as possible – Asaf Apr 01 '11 at 20:45

2 Answers2

6

Based on your comment, all you need is to scan relatively small amount of documents with almost 100% accuracy and your budget is about 200$

Well, the answer is simple then. You don't need any programming solution. Just buy quality commercial OCR product, f.e. ABBYY FineReader (disclaimer: I work for ABBYY). It has different prices in different regions, but I guess it is somewhere in about your budget.

Commercial desktop OCR product will provide you out-of-the box almost 100% accuracy on typical languages. Also they have convenient manual verification tools to fix all remaining errors. Typically they support whole variety of modern fonts, but if your font is not trivial, they do have font training utility for that.

I do think that is optimal solution for you.

UPDATE: Linux platform. Unfortunately, there is almost no choice of high quality OCR products for Linux, sorry. The only one I know is from ABBYY: http://ocr4linux.com/en:start but it does not have UI, verification and font training. But at least you can give it a try to see if it will give you good enough accuracy as it is, which may happen to be the case.

Muaz Usmani
  • 1,298
  • 6
  • 26
  • 48
Tomato
  • 2,169
  • 15
  • 24
  • My OS at home is Ubuntu, could that be a problem regarding FineReader? – Asaf Apr 05 '11 at 20:25
  • Tesseract works pretty well on "typical languages," the point of training is almost always to deal with non-typical languages... –  May 24 '12 at 18:24
2

You can use jTessBoxEditor to edit the box files you generate. Bundled with it is a PowerShell script to automate box file and final .traineddata file generation.

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • 1
    I tried this and it didn't work. I believe it's because tesseract now requires a font_properties file which it didn't previously require. – gsgx May 10 '12 at 18:53