Pytesseract OCR is not recognizing small words because of horizontal line

Question

I've just started exploring pytesseract. Here's the issue I'm facing:

I have the following input image.

Now, when I try running OCR on this, I get the following output:

Thanks for signing up. Now you too can pick your favorite pillows

Option AB

After trying it out on multiple image samples, I can safely conclude the following:

It is not a non-dict word penalty.
It is omitting words which are short. Almost as though it has a min-width that is taking effect, and is veto-ing any line with width lesser than that.
It only happens if the input image has the bounding rectangle around it. If I remove that from the input image I get the correct output.

i.e. on the following image:

I get the following output

Thanks for signing up. Now you too can pick your favorite pillows

Option

Opon

Option AB

I'm unable to figure out where am I going wrong. Here's the code I'm using:

from PIL import Image
import pytesseract
import argparse
import cv2

image = cv2.imread('testImage.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 240, 255, cv2.THRESH_BINARY)[1]

filename = "intermediate.png"
cv2.imwrite(filename, gray)

im = Image.open(filename)
text = pytesseract.image_to_string(im)
print(text)

I've tried tinkering around with some of the config parameters (crunch_del_min_width, language_model_min_compound_length, and a few others) too, but nothing helped.

score 2 · Answer 1 · answered Jun 11 '20 at 00:27

Solution found here: Pytesseract OCR multiple config options

My code was originally like yours (line 1) and I was encountering the same error. What worked for me was setting psm = 10 in the config param to allow single character recognition.

Code sometimes returning None:

line 1 : text = pytesseract.image_to_string(cropped)

Added code on the next line:

line 2 : text = text if text else pytesseract.image_to_string(cropped, config='--psm 10')

The first line will attempt to extract sentences. If it succeeds, the second line keeps the value the same. However, if it returns None, it will look for single characters (allowing smaller words to be outputted).

Alternatively, if you just wanted to catch small words:

line 1 : text = pytesseract.image_to_string(cropped, config='--psm 10')

Pytesseract OCR is not recognizing small words because of horizontal line

1 Answers1