Increase Accuracy of text recognition through pytesseract & PIL

Question

So I am trying to extract text from image. And as the quality and size of image is not good, it is giving inaccurate results. I tried few enhancements and other things with PIL but that is only worsening the quality of image.

Can someone suggest some enhancement in image to get better results. Few Examples of images:

As far as I understand it, there are quite rigid rules here on stackoverflow and one of this rules is to focus on answering the question and not on showing what else can be done to achieve the goal, if it was not asked for. So if you want also answers which can give you better results but are not based on enhancing the image feel free to ask for it in your question. — Claudio, Apr 14 '17 at 09:00
What about marking my answer as accepted? Have I missed to explain something? — Claudio, Apr 15 '17 at 14:54
P.S. check out my answer again - I have added some explanations to it. — Claudio, Apr 15 '17 at 15:03
Please be patient. I applied the concept and enlarged the image with PIL which gave better but not accurate result. I have not accepted the answer just to get some more answers. — sprksh, Apr 15 '17 at 15:42
I think it's time to get this question/answer cycle to an end, so that everyone can see that the question was answered ... — Claudio, Apr 27 '17 at 18:08

Claudio · Accepted Answer · 2017-04-16T16:42:09.453

In the provided example of image the text is visually of quite good quality, so the question is how it comes that OCR gives inaccurate results?

To illustrate the conclusions given in further text of this answer let's run the the given image

through Tesseract. Below the result of Tesseract OCR:

"fhpgearedmomrs©gmachom"

Now let's resize the image four times and apply thresholding to it. I have done the resizing and thresholding manually in Gimp, but with appropriate resizing method and threshold value for PIL it can be for sure automated, so that after the enhancement you get an image similar to the enhanced image I have got:

The improved image run through Tesseract OCR gives following text:

"fhpgearedmotors©gmail.com"

This demonstrates that enlarging an image can help to achieve 100% accuracy on the provided text-image example.

It may appear weird that enlarging an image helps to achieve better OCR accuracy, BUT ... OCR was developed to convert scans of printed media to texts and expect 300 dpi images of the text by design. This explains why some OCR programs didn't resize the text by themselves to improve their results and do bad on small fonts expecting higher dpi resolution of the image which can be achieved by enlarging.

Here an excerpt from Tesseract FAQ on github.com prooving the statement above:

[There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed".]

Very good explanation and good answer. I have been working on reading text from image(software for recognizing document sections) and I wanted to know if you have managed to get some kind of a dynamic variable of how many times you must enlarge image so it can recognize the text? For image that is 800x800 it is recognizing everything if enlarged to 1600x1600, but, image that is 30x800 needs to be enlarged to 120x3200 in order to recognize everything(commas, dots, slashes, etc...). Also, do you know why word "File" isn't recognized well? Char 'i' is not from english alphabet — Vulovic Vukasin, May 25 '17 at 14:17

Increase Accuracy of text recognition through pytesseract & PIL

1 Answers1

Linked