12

I've this python code which I use to convert a text written in a picture to a string, it does work for certain images which have large characters, but not for the one I'm trying right now which contains only digits.

This is the picture:

Digits

This is my code:

import pytesseract
from PIL import Image

img = Image.open('img.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
result = pytesseract.image_to_string(img)
print (result)

Why is it failing at recognising this specific image and how can I solve this problem?

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
alioua walid
  • 247
  • 3
  • 19
  • 2
    you can try limiting the sample space of characters, by only allowing numbers to be as output. More On this [whitelisting characters in pytesseract](https://stackoverflow.com/questions/43705481/pytesser-set-character-whitelist) – Vasu Deo.S May 13 '19 at 20:24

1 Answers1

8

I have two suggestions.

First, and this is by far the most important, in OCR preprocessing images is key to obtaining good results. In your case I suggest binarization. Your images look extremely good so you shouldn't have any problem but if you do, then maybe you should try to binarize your images:

import cv2
from PIL import Image

img = cv2.imread('gradient.png')
# If your image is not already grayscale :
# img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = 180 # to be determined
_, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)

And then try the ocr again with the binarized image.

Check if your image is in grayscale and uncomment if needed.

This is simple thresholding. Adaptive thresholding also exists but it is noisy and does not bring anything in your case.

Binarized images will be much easier for Tesseract to handle. This is already done internally (https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) but sometimes things can be messed up and very often it's useful to do your own preprocessing.

You can check if the threshold value is right by looking at the images :

import matplotlib.pyplot as plt
plt.imshow(img, cmap='gray')
plt.imshow(img_binarized, cmap='gray')

Second, if what I said above still doesn't work, I know this doesn't answer "why doesn't pytesseract work here" but I suggest you try out tesserocr. It is a maintained python wrapper for Tesseract.

You could try:

import tesserocr
text_from_ocr = tesserocr.image_to_text(pil_img)

Here is the doc for tesserocr from pypi : https://pypi.org/project/tesserocr/

And for opencv : https://pypi.org/project/opencv-python/

As a side-note, black and white is treated symetrically in Tesseract so having white digits on a black background is not a problem.

Ashargin
  • 498
  • 4
  • 11
  • Thanks for the precious informations, i'll check this out, i know that there is a specific configuration so i can have my result, this website can do the job for me : https://smallseotools.com/image-to-text-converter/ but i'm trying to do it using python only for a project. – alioua walid May 12 '19 at 14:50
  • 1
    The vast majority of online OCR (some are good but not all of them) tools use Tesseract under the hood. This means you can theoretically do it yourself but it can be very hard to use Tesseract and actually have very good results by finding the right configuration. You can still have decent results though, even with noisy documents, and since your images are that good you may/should be able to have clear results. – Ashargin May 13 '19 at 08:59