Can't get tesseract settings right

Asked Jun 22 '16 at 22:18

Active Jun 22 '16 at 22:18

Viewed 175 times

I am trying to use tesseract on this image:

When I use default configuration:

tesseract image.jpg stdout

It returns \KD FWOW.
As you can see, the only mistake is the first letter L being recognized as a backslash

So, I created a config file in /usr/share/tesseract-ocr/tessdata/configs with the setting:

tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUWXYZ

The goal is to recognize just letters, not special characters. However, when I run tesseract with this config:

tesseract image.jpg stdout letters

The result is XKD FVOIV, and now it is missing more than one character, mainly the 'W'.

This makes no sense to me, I cant figure why it stopped to recognize the W when it is on the whitelist. For sure I am missing something in the config.

How can I fix it?

edited Jun 20 '20 at 09:12

Community

asked Jun 22 '16 at 22:18

Tales Pádua

1,331
1
16
36

why not rectangularize the image first ... that is called preprocessing ... without proper preparation of data it is any CV operation useless ... – Spektre Jun 23 '16 at 06:13
The image was prepared to this point, but I am not using OpenCV, I am using imagemagick – Tales Pádua Jun 23 '16 at 14:08
That does not matter I do not use OpenCV either... find the skew from left and right ... and transform back to rectangular bounding box. similar to this http://stackoverflow.com/a/30273878/2521214 – Spektre Jun 23 '16 at 14:10

Can't get tesseract settings right

0 Answers0