0

I have to recognise text in a picture like this:

Image to recognise

I have tried Tesseract, but I am not very happy with the results.

Could you recommend me any software that could be more accurate in "text recognition on image" instead of "text recognition on document"?

Thanks in advance

jnovacho
  • 2,825
  • 6
  • 27
  • 44
froggy_
  • 43
  • 1
  • 10
  • Your question is likely to be off-topic for StackOverflow. Possibly StackExchange SuperUser might be a better group. On the other hand, why not edit your question to give a more detailed explanation of your problem rather than just "I am not very happy with the results". – Tedinoz Jun 28 '19 at 10:52

2 Answers2

1

Don't expect the Tesseract to work out of the box. This image needs some work before it is put to Tesseract.

I would do following preprocessing:

  1. blur the image to remove some of the digital noise
  2. adaptive thresholding with suitable parameters
  3. correct image colors to provide white background and black text
    • this should be easy operations just invert the colors if necessary
  4. run Tesseract with correct language files (italian, I guess?)

These preprocessing steps are really easy to program by hand, but of course there is plenty of libs with this capabilities.

As a starting point see this: Preprocessing image for Tesseract OCR with OpenCV

jnovacho
  • 2,825
  • 6
  • 27
  • 44
  • Thanks so much!! I will try this. But what do you mean when you say "blur"?? P.s: the language is Spanish :) – froggy_ Jun 28 '19 at 10:59
  • On the wiki there is a nice example showing what "blur" is doing to the image: https://en.wikipedia.org/wiki/Gaussian_blur#/media/File:Halftone,_Gaussian_Blur.jpg You have to be carefull to not do it too much. But small amount of blurring usually can improve the results of thresholding, because it removes the noise. – jnovacho Jun 28 '19 at 11:02
0

I don't know of any ready made software that would do text extraction on your specific image without a lot of additional configurations, but you can probably improve your Tesseract results

You can try to treat the image so it's easier for Tesseract to recognize it, use tessedit_write_images true to see your image after Tesseract does it's automatic adjustments

It probably isn't the best so you can do the adjustments yourself with the many libraries/programs available, your goal should be to transform it to a black on white text image, with as little noise as possible

For this read: ImproveQuality

You can also try to train Tesseract for your specific data, but this will require a lot more work, and large amounts of training data, read: TrainingTesseract 4.0

victormeriqui
  • 151
  • 2
  • 10