How to improve OCR with Pytesseract text recognition?

Question

Hi I am looking to improve my performance with pytesseract at digit recognition.

I take my raw image and split it into parts that look like this:

The size can vary.

To this I apply some pre-processing methods like so

image = cv2.imread(im, cv2.IMREAD_GRAYSCALE)
image = cv2.GaussianBlur(image, (1, 1), 0)
kernel = np.ones((5, 5), np.uint8)
result_img = cv2.blur(img, (2, 2), 0)
result_img = cv2.dilate(result_img, kernel, iterations=1)
result_img = cv2.erode(result_img, kernel, iterations=1)

and I get this

I then pass this to pytesseract:

num = pytesseract.image_to_string(result_img, lang='eng',
                                     config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

However this is not good enough for me and often gets numbers wrong.

I am looking for ways to improve, I have tried to keep this minimal and self contained but let me know if I've not been clear and I will elaborate.

Thank you.

What does the erode step give you? I would do a blur and then a compression of the dynamic range, that is, most light colors go to white, and most of darks, to black, leaving a rather narrow area for grays, just to make the borders less jagged. — 9000, Mar 10 '20 at 19:07
Can you please tell me how to do the compression of dynamic range? In fact would you be able to take the top image and show me how you would process that to get it to be recongnized by tesseract? If you do please make it an answer so I can accept! — tepsupek, Mar 10 '20 at 20:27

score 8 · Answer 1 · answered Mar 10 '20 at 23:00

You're on the right track by trying to preprocess the image before performing OCR but using an incorrect approach. There is no reason to dilate or erode the image since these operations are mainly used for removing small noise particles. In addition, your current output is not a binary image. It may look like it only contains black and white pixels but it is actually a 3-channel BGR image which is probably why you're getting incorrect OCR results. If you look at Tesseract improve quality, you will notice that for Pytesseract to perform optimal OCR, the image needs to be preprocessed so that the desired text to detect is in black with the background in white. To do this, we can perform a Otsu's threshold to obtain a binary image then invert it so the text is in the foreground. This will result in our preprocessed image where we can throw it into image_to_string. We use the --psm 6 configuration option to assume a single uniform block of text. Take a look at configuration options for more settings. Here's the results:

Input image -> Binary -> Invert

Result from Pytesseract OCR

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, grayscale, Otsu's threshold, invert
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
invert = 255 - thresh

# OCR
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('invert', invert)
cv2.waitKey()

Another thing to have in mind is the psm mode, for a single character you can use the config='--psm 10' option in order to improve your detection because this option is for a single character. You can find more out in https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/ — Williams Bobadilla, May 13 '22 at 17:34

How to improve OCR with Pytesseract text recognition?

1 Answers1

Linked