Improving results by Tesseract

Question

I have a document(pdf), which contains some text in the Hindi language. I converted it into a .tiff image using ImageMagick, with the command:

magick convert -density 300 filename.pdf -depth 8 test.tiff

Then, I used tesseract to perform OCR on the .tiff picture:

C:\Users\H.P\Downloads>tesseract test.tiff test1.txt -l hin
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Page 1
Page 2
Page 3

But the result isn't appropriate at all. The choices available to me for improving the result are:

Preprocessing the image.
Training Tesseract for the particular font.

Given the cleanness of the text in the .pdf file, I'm leaning towards an assumption that it doesn't require any preprocessing. Though, since the text is in columns, it might require some segmentation. Being not sure of what steps should be taken, I thought of rather asking, before doing anything.

So, what should be done to the given image, in order for Tesseract to perform better?

The document looks something like:

remove the lines/tables, this answer https://stackoverflow.com/questions/33452222/detect-table-with-opencv/46806306#46806306 might be helpfull. — flamelite, May 28 '18 at 09:55
It would be helpful if you let others know whether you found a solution or you changed the tool or did some manipulations to the image in order to arrive at a satisfactory solution. — SKR, Sep 24 '18 at 05:11
@SKR Eventually all I had to do was upgrade to Tesseract 4, to improve the results. — Mooncrater, Sep 24 '18 at 05:13
@Mooncrater Did only the upgrade work? a) Did you do any additional image processing apart from convert to .tiff as you said? b) in this voter list were you able to recognize the numbers in the boxes such as 819, 824, etc and in between lines? c) can you tell the tesseract command options you used for config ? Thanks — SKR, Sep 24 '18 at 05:43
@SKR a. No. Since the PDF already was of high quality for text extraction. b. Yes. I had to crop those particular parts out, and use Tesseract on them. c. Yes. Look at [this](https://stackoverflow.com/a/44085281/5345646). For the box number, we might think of using `--psm 8` but `--psm 7` worked better for me. — Mooncrater, Sep 24 '18 at 06:57
@Mooncrater a) what is your standard of high-quality PDF? b) when you convert to .tiff vector-> raster so how do you preserve the same high-quality? c) manual cropping for such small boxes and so many docs can be really cumbersome. I have some text bounded inside a circle and square. What do you advise for detecting such text? d) Link you provided is same as man tesseract. Do you know any source which explains clearly each config options? — SKR, Sep 24 '18 at 08:34
@SKR a. [This](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) suggests image preprocessing methods to make the image binarized, noise removed, deskewed, and border removed. My PDF already followed all those. The output was near perfect for all the images. b. Why would I convert to raster? .tiff works fine with `pytesseract`.(Don't really know about raster format).c. The format was predictable. Cropping was automated. Are the shaped coloured? Are they removable? d. I don't think so I've anything better. I would recomment trial and error. — Mooncrater, Sep 24 '18 at 19:45
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/180689/discussion-between-mooncrater-and-skr). — Mooncrater, Sep 24 '18 at 20:03
@Mooncrater sorry couldn't come to chat, now chat room doesn't exist anymore. Continuing our discussion: c) how was your cropping automated? Look [at this](https://imgur.com/21zWk4A) for example and kindly let me know how can I read the text inside circles or boxes or boxes with diamonds or rectangular shapes or cylindrical shapes? They are not colored but they are so much in the main image that I can't possibly do manually. I tried to detect circles through Hough transforms but now I want to know how to remove the shapes and leave the text intact? — SKR, Oct 16 '18 at 03:14

Improving results by Tesseract

0 Answers0