Character Recognition using tesseract

Question

I am trying to interact with tesseract API also I am new to image processing and I am just struggling with it for last few days. I have tried simple algorithms and I have achieved around 70% accuracy.

I want its accuracy to be 90+%. The problem with the images is that they are in 72dpi. I also tried to increase the resolution but did not get good results the images which I am trying to be recognized are attached.

Any help would be appreciated and I am sorry if I asked something very basic.

EDIT

I forgot to mention that I am trying to do all the processing and recognition within 2-2.5 secs on Linux platform and method to detect the text mentioned in this answer is taking a lot of time. Also I prefer not to use command line solution but I would prefer Leptonica or OpenCV.

Most of the images are uploaded here

I have tried following things to binarize the tickets but no luck

Ticket contains

little bit bad light
Non-text area
less resolution

I tried to feed the image direct to tesseract API and it is giving me 70% good results in 1 sec average. But I want to increase the accuracy in noticing the time factor So far I have tried

Detect edges of the image
Blob Analysis for blobs
Binarized the ticket using adaptive thresholding

Then I tried to feed those binarized images to tesseract, the accuracy reduced to less than 50-60%, though binarized image look perfect.

Not sure what your question is, do you have any code that could should where the problem is? — Caesar, Dec 20 '13 at 12:32

score 3 · Accepted Answer · edited May 23 '17 at 12:31

There are several things you could try:

To be able to improve the accuracy you should improve the quality of the image for the OCR engine, and that means preprocessing the images before feeding them to Tesseract. I suggest investigating OpenCV for this purpose.
The main problem with OCR engines is that they are not as good at recognizing characters as we are. So even things that are not text sometimes get mistakenly identified as if they were. Therefore, to prevent this from happening it's best to detect the areas of text and send those to Tesseract instead of sending the full image, like you are doing with image #2.
Another way to extract the text regions of an image can be done with this technique to isolate them.
When you get the results from Tesseract, you can improve them by comparing the resulting text to a dictionary.

score 2 · Answer 2 · answered Dec 20 '13 at 14:27

2

Some possible improvements:

The resolution should be 300 dpi at least.
Make your illumination more averagely distributed. There are several dark areas that might impact the results.
Try to scale your characters a little bit. Currently they are in different sizes, and some of the letters are even distorted.
Pre-process the image by thresholding and binarization.

You can do above with your own programming, or Fred's ImageMagick Scripts might help.

answered Dec 20 '13 at 14:27

lennon310

12,503
11
43
61

any tutorial to distribute illumination? – Muaz Usmani Jan 02 '14 at 09:50
1

@MuhammadMaaz you can try to erode the image and then dilate to mimick a white paper sheet, subtraction will eliminate the uneven effect. See this post for more details: http://dsp.stackexchange.com/questions/1932/what-are-the-best-algorithms-for-document-image-thresholding-in-this-example – lennon310 Jan 02 '14 at 19:01
What about detecting the text area? – Muaz Usmani Jan 02 '14 at 19:24

score 0 · Answer 3 · answered Jan 10 '14 at 17:34

I'm not sure if my post is useful for you, because my answer is not about Tesseract. But it is about high accuracy, so I decided that it can be interesting for you to see how paid OCR SDK solution works.

That's results of recognition with ABBYY Cloud OCR SDK without any additional settings.

enter image description here

Disclaimer: I work for ABBYY.

score 0 · Answer 4 · answered Jan 10 '14 at 22:02

You can try to use ScanTailor (http://scantailor.sourceforge.net/ it has also CLI interface) to binarize, deskew and dewarp images. Scaling images up might help to improve recognition. Because Tesseract recognition profiles were optimized to work on at least 300 DPI.

Another possibility is to train Tesseract on font which are characteristic for your material (more on this can be here: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3).

I don't think that dictionary lookup will help here, because you have mostly numbers.

Character Recognition using tesseract

4 Answers4

Linked