I have a document(pdf), which contains some text in the Hindi language. I converted it into a .tiff image using ImageMagick, with the command:
magick convert -density 300 filename.pdf -depth 8 test.tiff
Then, I used tesseract
to perform OCR on the .tiff
picture:
C:\Users\H.P\Downloads>tesseract test.tiff test1.txt -l hin
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Page 1
Page 2
Page 3
But the result isn't appropriate at all. The choices available to me for improving the result are:
- Preprocessing the image.
- Training Tesseract for the particular font.
Given the cleanness of the text in the .pdf file, I'm leaning towards an assumption that it doesn't require any preprocessing. Though, since the text is in columns, it might require some segmentation. Being not sure of what steps should be taken, I thought of rather asking, before doing anything.
So, what should be done to the given image, in order for Tesseract to perform better?