Low success rate with pytesser? Is this an issue of noise, or is there something else that needs to be done?

Question

I'm trying to detect a few uppercase characters from a screen shot. I convert it to black and white with PIL, and then using the code example from the PyTesser page, I run tesser.exe on the image:

from pytesser import *
image = Image.open('fnord.tif') 
print image_to_string(image)

I'm using this image:

But it doesn't recognize it as an E, or really anything for that matter. I think that it's a clean enough capture? The noise at the top isn't throwing it off, right?

Is there something I'm missing?

I've run the command line util which shows `Tesseract Open Source OCR Engine v3.02 with Leptonica` - without a `psm` option - I get an empty file. Using `-psm 10 ` which is supposedly "treat the image as a single character" - I get `%` followed by two newlines... — Jon Clements, Aug 12 '12 at 18:00
[Limiting the characters tesseract looks for](http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for) helped me in the past. — user500198, Sep 01 '12 at 18:15
If the uppercase characters you are trying to recognize are in an unique font type that is clear as shown in the question, then there isn't much reason to rely on tesseract for that. Some simple topological features together with skeleton information can solve that directly. — mmgp, Feb 10 '13 at 05:36

score 1 · Answer 1 · answered Sep 19 '12 at 08:27

If you are concerned about whether the noise is an issue then manually open the image in MSPaint or something similar, remove the noise and then run the new image through the OCR. This is the best way to learn how the OCR engine works and what confuses it and what doesn't. Every OCR engine works differently.

In this case it could be the small bits of noise are confusing the character zoning process as well. You should check the bounding box values returned from the OCR engine to see if the OCR engine is even looking in the correct location for your word or character.

Some OCR engines have options to remove noise from an image during the OCR process. This is often called depspeckle or noise removal. It would be possible to remove noise using Leptonica ( http://www.leptonica.org ) which is now part of the latest Tesseract images.

Screen fonts present a big challenge to OCR engines because the DPI is often very low. In the case of your 'E' there should be more than enough pixels to be recognised. The heavy stroke weight could be confusing the engine.

Also the commercial engines will usually be more accurate than Tesseract but will also come with expensive licence fees.

Low success rate with pytesser? Is this an issue of noise, or is there something else that needs to be done?

1 Answers1