0

I am trying to classify an image based on its content. For example, I have got loads of images as below, that will contain some content – in this case numeric values. I had tried OpenCV and Pytesseract OCR solution as proposed here: https://stackoverflow.com/a/60161328/7250310

However, this solution doesn't work on my images, and the content isn't detected. Below are my sample images:

Image 1: enter image description here

Image 2: enter image description here

Image 3: enter image description here

Image 4: enter image description here

Do you have any other ideas to achieve this? Basically Image 1 should give output as 1, and so on.

HansHirse
  • 18,010
  • 10
  • 38
  • 67
Fazal
  • 87
  • 9

1 Answers1

2

This simple approach works at least for the four presented images:

import cv2
import pytesseract

images = ['4sXGS.jpg', 'Nizki.jpg', 'T0EM8.jpg', 'g2fY7.jpg']

for img in images:

    img = cv2.imread(img, cv2.IMREAD_GRAYSCALE)
    img = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]

    text = pytesseract.image_to_string(img, config='--psm 10')
    text = text.replace('\n', '').replace('\f', '')
    print(text)

Output:

1
2
3
4

The single steps are:

  1. Read the image as grayscale.
  2. Inverse binary threshold the image using Otsu's method.
  3. Run pytesseract using the -psm 10 option (single character). Maybe also add the described whitelisting for identifying digits only.

Caveat: I use a special version of Tesseract from the Mannheim University Library.

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.19041-SP0
Python:        3.9.1
PyCharm:       2021.1.1
OpenCV:        4.5.2
pytesseract:   5.0.0-alpha.20201127
----------------------------------------
HansHirse
  • 18,010
  • 10
  • 38
  • 67
  • thank you for sharing. Is there a mac version fo the special version I can install from? I ran same code with normal tesseract and it doesnt work for digit 1 image. – Fazal Jun 01 '21 at 04:57
  • @Fazal Unfortunately, I can't give any advice on that. The "special" mostly refers to the fact, that they built their own Windows installer. The underlying source code should be the common (or current) Tesseract 5.0.0.0-alpha. Maybe search for mac OC build instructions for that version!? What's your version of Tesseract? – HansHirse Jun 01 '21 at 06:47
  • 4.1.1 is the version I have got installed. I tried to find mac oc build couldnt find it. Maybe its too complicated for me. – Fazal Jun 01 '21 at 16:36