What could I do to improve my OCR result using pytesseract?

Question

I am trying to apply OCR using OpenCV and Python-tesseract to convert the following image to text: Original image.

But tesseract has not managed to correctly read the image as of yet. It reads:uleswylly Bie7 Srp a7 instead.

I have taken the following steps to pre-process the image before I feed it to tesseract:

First I upscale the image:

# Image scaling
def set_image_dpi(img):
    # Get current dimensions of the image
    height, width = img.shape[:2]

    # Define scale factor
    scale_factor = 6

    # Calculate new dimensions
    new_height = int(height * scale_factor)
    new_width = int(width * scale_factor)

    # Resize image
    return cv2.resize(img, (new_width, new_height))

Image result: result1.png

Normalize the image:

# Normalization
norm_img = np.zeros((img.shape[0], img.shape[1]))
img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX)

Image result: result2.png

Then I remove some noise:

# Remove noise
def remove_noise(img):
    return cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15)

Image result: result3.png

Get the grayscale image:

# Get grayscale
def get_grayscale(img):
    return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Image result: result4.png

Apply thresholding:

# Thresholding
def thresholding(img):
    return cv2.threshold(img, 150, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) [1]

Image result: result5.png

Invert the image color:

# Invert the image
def invert(img):
    return cv2.bitwise_not(img)

Image result: result6.png

Finally I pass the image to pytesseract:

# Pass preprocessed image to pytesseract
text = pytesseract.image_to_string(img)
print("Text found: " + text)

pytesseract output: "uleswylly Bie7 Srp a7"

I would like to improve my pre-processing so that pytesseract can actually read the image? Any help would be greatly appreciated!

Thanks in advance,

Steenert

Resize as you did, then threshold on the white letters using cv2.inRange(). That should give just the letters on a black background. Then invert that and do pytesseract on that. — fmw42, Apr 17 '23 at 19:29

score 3 · Accepted Answer · answered Apr 17 '23 at 20:38

The problem is a bit challenging, without overfitting the solution to the problem...

Let assume that the text is bright, colorless and surrounded by colored pixels. We may also assume that the background is relatively homogenous.

We may start with result3.png and use the following stages:

Add padding with the color of the top left pixel.
The padding is used as preparation for floodFill (required because some colored pixel touches the image margins).
Fill the background with light blue color.
Note that the selected color is a bit of an overfitting, because the saturation level needs to be close to the level of the red pixels.
Convert from BGR to HSV color space, and extract the saturation channel.
Apply thresholding (use cv2.THRESH_OTSU for automatic thresholding).
Apply pytesseract.image_to_string to the thresholded image.

Code sample:

import cv2
import numpy as np
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # May be required when using Windows

img = cv2.imread('result3.png')  # Read result3.png

# Add padding with the color of the top left pixel
pad_color = img[0, 0, :]
padded_img = np.full((img.shape[0]+10, img.shape[1]+10, 3), pad_color, np.uint8)
padded_img[5:-5, 5:-5, :] = img

cv2.floodFill(padded_img, None, (0, 0), (255, 100, 100), loDiff=(10, 10, 10), upDiff=(10, 10, 10))  # Fill the background with blue color.
cv2.imwrite('result7.png', padded_img)

# Convert from BGR to HSV color space, and extract the saturation channel.
hsv = cv2.cvtColor(padded_img, cv2.COLOR_BGR2HSV)
s = hsv[:, :, 1]
cv2.imwrite('result8.png', s)

# Apply thresholding (use `cv2.THRESH_OTSU` for automatic thresholding)
thresh = cv2.threshold(s, 0, 255, cv2.THRESH_OTSU)[1]
cv2.imwrite('result9.png', thresh)

# Pass preprocessed image to PyTesseract
text = pytesseract.image_to_string(thresh, config="--psm 6")
print("Text found: " + text)

Output:
Text found: Jules -Lv: 175 -P.17

result7.png (after floodFill):

result8.png (after extracting the saturation channel):

result9.png (after thresholding):

This worked perfectly!, I did not think about adding padding and adjusting the image saturation before threshold. Thank you — Steenert, Apr 17 '23 at 21:29

What could I do to improve my OCR result using pytesseract?

1 Answers1

Linked