Tesseract confuses two numbers

Question

I'm writing an application to scan numbers from an image.

The numbers are using the OCR-B font and may also contain + and > characters.

This is my source image:

source image

The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.

I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.

Then I did all steps described here to create the other necessary files.

Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was

$ tesseract esr2c.tif ocrb-esr2c -l ocrb

and the output for the source image was

0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20

If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).

How could this happen? Did I do some mistake in the training process? How can I fix it?

@andrew not really. just an old, invalid bill without any personal information in the reference id. — Danilo Bargen, Sep 03 '11 at 13:33
@DaniloBargen: If possible, can you share the training data for OCRB font? — Ravi Gupta, Jan 04 '16 at 08:42
@RaviGupta I don't have it anymore, and the results weren't good anyways. — Danilo Bargen, Jan 04 '16 at 18:10
Hi, so, sorry for revamping this but, fast forward 5 years then fast forward 1 year, has that training helped you get correct results, I mean, did you continue to use tesseract? — Marko, Feb 15 '17 at 00:01

score 6 · Accepted Answer · answered Sep 03 '11 at 16:53

6

It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.

answered Sep 03 '11 at 16:53

nguyenq

8,212
1
16
16

I can't get jTessBoxEditor to work (some issues with the imageio library), but I checked the box file with [OwlBoxer](http://code.google.com/p/owlboxer/) (I actually processed the file using that tool) and everything looks correct. – Danilo Bargen Sep 03 '11 at 18:06
I also just double-checked the box file using tesseractTrainer.py, and still didn't find any errors. – Danilo Bargen Sep 03 '11 at 18:13
Can you post a link to your box file? Which Tesseract version are you training for? – nguyenq Sep 04 '11 at 06:27
I'm using Tesseract 2.04. Here's the box file and the corresponding image: http://w00t.ws/boxfile.tar.gz – Danilo Bargen Sep 04 '11 at 10:22
2

Right off the bat, I can see, through jTessBoxEditor, that the first four characters (lines) are out of position and sequence in the box file. Please try move them to the correct position and sequence and continue with the remaining training steps. (Sorry, I haven't set up a place where I can upload files.) – nguyenq Sep 04 '11 at 13:02
1

Wow, that did it! I didn't think the order would make any difference :) Thanks for your help! You might want to mention the solution in your answer. – Danilo Bargen Sep 04 '11 at 13:44

score 2 · Answer 2 · edited Jan 23 '12 at 07:35

2

I have trained tesseract 2.04 after 1 month efforts for OCR A extended font. Its working very well and showing above 90 Accuracy with font size 14.

Training image should be high Contrast image. Use "GIMP" image editor and do following Menu Colors->Info->Histgram- Read Std Deviation value colors-> Threshould -> Write "Std Deviation value" as Threshould value Save image Use it for training.

Check and edit your box file using "qt-box-editor-1.06.exe".It is very easy to use. Check All boxes and characters in it. It is very important. Somewhere in your box file has incorrect characters for 1 and 8.

Run other cmds.

edited Jan 23 '12 at 07:35

mlissner

17,359
18
106
169

answered Dec 20 '11 at 13:47

yogeshjoshicolor

71
1
4

I already solved the problem (see comment on other answer). The order of the boxes was wrong in jTessBoxEditor. Thanks anyways. – Danilo Bargen Dec 20 '11 at 16:50
I trained tesseract 3.02 for OCR-B.. It returns like 100% accuracy on same training set. But when I check with real life pictures it is almost zero accuracy. Did it work for you guys? – Masri Mar 04 '13 at 01:25

Tesseract confuses two numbers

2 Answers2

Linked