0

Thanks in advance to everyone that will answer.

I am new to OpenCV, Pytesseract and overall very inexperienced about image processing and recognition.

I am trying to detect a digit from a pdf, for the sake of this code I will directly provide the image: Initial image

My objective is to detect the number in the colored box, which in this case is number 6. My code for preprocessing is the following:

import numpy as np
import pytesseract
from PIL import Image
from PIL import ImageFilter, ImageEnhance

pytesseract.pytesseract.tesseract_cmd = 'Tesseract-OCR\tesseract.exe'


# -----Reading the image-----------------------------------------------------
img = cv2.imread('page_image.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = cv2.resize(gray, (1028, 720))

thres_gray = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU)[1]
gray_inv = cv2.bitwise_not(thres_gray)
gray_test = cv2.bitwise_not(gray_inv)

out2 = cv2.bitwise_or(gray, gray, mask=gray_inv)

thresh_end = cv2.threshold(out2, 254, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

imageObject = Image.fromarray(thresh_end)
enhancer = ImageEnhance.Sharpness(imageObject)


sharpened1 = imageObject.filter(ImageFilter.SHARPEN)
sharpened2 = sharpened1.filter(ImageFilter.SHARPEN)
# sharpened2.show()

From this I obtain the following picture: Preprocessed image

At this point, since I am still learning about how to detect the region of interest and crop it with OpenCV, to test the code I decided to manually crop the image to test if my script works correctly enough.

Therefore the image I pass to pytesseract is the following: Final image to read with pytesseract I am not really sure if the image is good enough to be read, but this is the best I could get. From this I try image_to_string:

trial = pytesseract.image_to_string(sharpened2, config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789')

I have tried many different configurations for the tesseract but none of it worked and the final output is always an empty string.

I would be really grateful if you could help me understand whether the image is not good enough or I am doing something wrong with the tesseract configuration. If you could also be able to help me cropping the image correctly that would be awesome, but even detecting the number is enough for me.

Sorry for the long post and thanks again.

mbronzo
  • 11
  • 4
  • You could try to invert black/white on the final image, black on white. Also, final processed image is much smaller than the original, why ? – lucians Sep 22 '21 at 15:30
  • As you suggested I tried a `cv2.bitwise_not(sharpened2)` but the output of the tesseract is still empty. Regarding the size of the image I think this is the results of manually cropping, but I can still resize it even if it results in a blurred image – mbronzo Sep 22 '21 at 15:44
  • Take a look here if you didn't already: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html I'll try to make a script meantime.. Is the box around the number always orange ? – lucians Sep 22 '21 at 15:48
  • Thank you a lot for the effort, I will double check the documentation to see if I can improve. Unfortunately, different companies adopt different colors for the boxes, that is why I used `cv2.COLOR_BGR2GRAY`, the initial idea was to make the whole image totally blank just leaving the grey box with the number, so that I didn't need to crop it. But Unfortunately that was the most I could obtain. – mbronzo Sep 22 '21 at 15:56
  • the "Preprocessed image" looks defective. please check. pixel (sample) aspect ratio looks to be affected. – Christoph Rackwitz Sep 22 '21 at 17:54

1 Answers1

1

Try this:

import cv2
import pytesseract
import numpy as np

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

img = cv2.imread("form.jpg")

# https://stackoverflow.com/questions/10948589/choosing-the-correct-upper-and-lower-hsv-boundaries-for-color-detection-withcv
ORANGE_MIN = np.array([5, 50, 50], np.uint8)
ORANGE_MAX = np.array([15, 255, 255], np.uint8)

hsv_img = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
frame_threshed = cv2.inRange(hsv_img, ORANGE_MIN, ORANGE_MAX)
# cv2.imshow("frame_threshed", frame_threshed)

thresh = cv2.threshold(frame_threshed, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# cv2.imshow("thresh", thresh)

cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
# cv2.imshow("dilate", thresh)

for c in cnts:
    x, y, w, h = cv2.boundingRect(c)
    ROI = thresh[y:y + h, x:x + w]

    ratio = 100.0 / ROI.shape[1]
    dim = (100, int(ROI.shape[0] * ratio))

    resizedCubic = cv2.resize(ROI, dim, interpolation=cv2.INTER_CUBIC)
    threshGauss = cv2.adaptiveThreshold(resizedCubic, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 255, 17)

    cv2.imshow("ROI", threshGauss)

    text = int(pytesseract.image_to_string(threshGauss, lang='eng', config="--oem 3 --psm 13"))
    print(f"Detected text: {text}")


cv2.waitKey(0)

I used HSV method to detect orange color first. Then, once the ROI was clearly visible, I applied "classic" image pre-processing steps. Take a look at this link to understand how to select other colors than orange.

I also resized the ROI a bit.

POC

lucians
  • 2,239
  • 5
  • 36
  • 64
  • I would really like to thank you. Sorry for the late reply but it took some time to understand what and how you did. I don't really know why, initially the tesseract was detecting "a" as a character instead of the number "6", when I changed the language to French instead of English it yielded the correct result. Your help was really valuable and it made me understand a little bit more about image processing! – mbronzo Sep 23 '21 at 07:44
  • I use eng in my code as language. Check to download the version "best" of your tessdata. Also, take a look at my profile to see other answer/questions about this topic. Also search for user "nathancy" (answered in some of my questions) who is a monster at opencv. – lucians Sep 23 '21 at 07:59
  • Thank you a lot for all the help. I am testing the code with other pdfs of the same format (still with orange) but unfortunately it is not working really good as the tesseract often detect letters instead of digits. Other times instead the region of interests is empty. Do you think is more an issue of the preprocessing or the tesseract itself? Most of the times digits looks easily recognizable but the tesseract fails – mbronzo Sep 23 '21 at 10:31