9

I use tesseract-OCR to extract text from scanned images, For few images text is not properly recognized due to low resolution and output produced is some irrelevant characters.

Techniques applied:

  1. Increase the dpi to 300.

  2. Image pre- processing techniques in opencv.

  3. Upscaling of images using dnn_superres in opencv

  4. Noise removal techniques.

  5. Refereed git repos where super-resolution algorithm model is developed using Deep learning.

  6. Improve tesseract-ocr quality by training tessdata.

Reference Links:

  1. Improve OCR accuracy from scanned documents
  2. image processing to improve tesseract OCR accuracy

Sample Image:

enter image description here

Is there any simple way in python to improve the text without using any Deep learning model.

Jennifer
  • 119
  • 1
  • 8
  • 6
    Sadly, there is often no substitute for starting with an image of minimal acceptable quality. I couldn't do anything for this image using scaling and morphology tricks. I would be impressed if deep learning would work on an image like this. I suppose if you had *many* training documents that looked like this in the exact same font, you might have a chance. – bfris May 09 '20 at 04:52
  • You might get some results from a maximum likelihood network based on the same font characters. It will be slow going and you'll still get false matches, at that point you will be able to use a spelling checker. Even so, when information isn't there, you can't fake it. Some of those characters might make even a *human* unsure (e.g. "bear" vs "hear"). – LSerni May 13 '20 at 22:48
  • Have you tried the filters from https://towardsdatascience.com/ocr-with-akka-tesseract-and-javacv-part-1-702781fc73ca it’s scala - but it should not be an issue as long as it calls the cv2 – marek.kapowicki Feb 08 '21 at 21:26

1 Answers1

7

I am aware you would prefer to upscale these input images with using deep learning, but I would highly recommend experimenting with https://github.com/alexjc/neural-enhance, assuming you have the appropriate hardware to run the neural networks and deep learning.

The results for your OCR input images could be promising. The documentation for the code is quite substantial.

Hope this helps you!

Matthew Smith
  • 508
  • 7
  • 22