0

So my project involves transcribing texts in jpg files into text files, and we are currently using tesseract. However, at this current level, tesseract is not doing so well at transcribing the texts in jpg files. So I decided to use some image preprocessing program to make the image files more optimal to be fed into tesseract.

The example of image file is like this: http://i46.tinypic.com/opramo.jpg

basically old news in image forms.

Any suggestion on which image processing engine to use??? Thank you !

Sardonic
  • 441
  • 3
  • 8
  • 19
  • Image quality is so bad, that I don't think you can improve it automatically. However you can try ImageMagic. – Eddy_Em Mar 21 '13 at 08:01
  • The noise seems to have been removed quite well, but the text suffers from disjointed and joined characters. Is this the original image you are given? Or do you have control over the steps from the original scanned document to get to this binarized version? Perhaps you can improve that process (with the focus on preparing the document for OCR). – Noremac Mar 22 '13 at 14:43
  • I only have pdf or jpg file of the image. What do you mean I can focus on preparing the document for OCR??? – Sardonic Mar 23 '13 at 20:22

0 Answers0