How to improve tessaract ocr accuracy?

Question

I have a PDF which contains a scanned document where I should be reading some parts of it. I already had it done with Google Cloud OCR, but I just noticed it might not be adequate as I'll be exceeding monthly quota (1k requests/month), so instead I'm switching to Tessaract.

The project is done in Windows and Java, but currently I'm doing some tests using linux.

I am not uploading my original image or none of them as I am not sure if it contains sensible information, but rather some images from the internet which are VERY similar.

I have read that I can help improve Tessaract to have a better quality doing some previous work on the original image (using TextCleaner?). I would like to know how to do that kind of stuff in a windows/java enviroment and most important, how to eliminate successfully the dark background on the table and if possible eliminate the horizontal and vertical lines of the table as the don't help at all during the OCR.

I wasn't. I tried training Tessaract and also tried some library called ocropy, with no success. I obtained the best results with google ocr, but not really good for what I was expecting — Manzha, Aug 01 '19 at 22:11

score 0 · Answer 1 · answered Jan 24 '18 at 19:05

0

Yes, you are right, you can clean the image to get a better recognition, see https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality .

answered Jan 24 '18 at 19:05

BluEOS

576
6
13

I have read it before, and tried ImageMagick textcleaner to try improve my image quality (deskew and taking off the background) but I'm not having success at all. Also those tools are for linux or pyhton and I'm looking for something that works in java in a windows environment. – Manzha Jan 24 '18 at 19:10

score 0 · Answer 2 · answered Mar 28 '18 at 10:28

0

You can use ImageMagick to sharpen the image(high resolution). Tessaract works better on high resolution images. If you are using python(I think you don't), pillow (PIL or Python Imaging Library) works great to enhance the quality of images.

answered Mar 28 '18 at 10:28

Sarvas

105
2
10

score 0 · Answer 3 · answered Mar 28 '18 at 19:51

My text cleaner script will not help much with this image. It won't remove the dark background, especially since it is textured. For other images will large regions of nearly constant color, it can make that background white. But it runs only on Unix-like systems and not with java. So for Windows you would need to use Windows 10 built-in Unix or install Cygwin.

Here is one example from http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

Input:

textcleaner -g -e stretch -f 25 -o 10 -s 1 twinkle.jpg twinkle_g_stretch_f25_o10_s1.jpg

score 0 · Answer 4 · answered Sep 18 '18 at 04:40

Text Recognition depends on a variety of factors to produce a good quality output. OCR output highly depends on the quality of input image. This is why every OCR engine provides guidelines regarding the quality of input image and its size. These guidelines help OCR engine to produce accurate results.

Here Image Preprocessing comes into play to improve the quality of input image so that the OCR engine gives you an accurate output.

I have written a detailed article on image processing in python. Kindly follow the link below for more explanation.

https://medium.com/cashify-engineering/improve-accuracy-of-ocr-using-image-preprocessing-8df29ec3a033

While this may answer the question, [it would be preferable](//meta.stackoverflow.com/q/8259) to include the essential parts of the answer here, and provide the link for reference. — jhpratt, Sep 18 '18 at 05:00

How to improve tessaract ocr accuracy?

4 Answers4