Can tesseract correctly recognise underscores in images?

Question

I have pictures that look like this:

And I am trying to get the output: "_ _ _ _ _ _ _ _ _ _ c _."

I was working in Python 3.6 and tried to use tesseract for this. What I got so far is the following code:

import pytesseract
from PIL import Image

# set tesseract file path
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract-OCR/tesseract.exe"
# configurations
config = "--psm 10 --oem 3 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzßäöü0123456789_-"

image = Image.open("test2.png")

text = pytesseract.image_to_string(image, config=config)

However, this doesn't work. It just produces "ee" as output. With other pictures, it sometimes recognizes the correct letters, but never the underscores. I tried to whitelist them, but that didn't work either. How can this be done better? I would be grateful for any suggestions.

shaman · Answer 1 · 2021-10-31T09:54:37.777

I am currently having a similar problem.

One possible solution which I was thinking may works (but heavy on performance I suppose), is to use the cv2 module to detect horizontal lines and use the detected pixelpositions to fill the space inbetween with underscore.

You also have to get the words which are adjacent to the min and max line-pixels, then find the words in the result-string from pytesseract to put the underscores at the right place in the string.

Here's a nice thread about finding lines in a picture, which may is helpful: Horizontal Line detection with OpenCV

Edit: What I now do may is a bit dirty but I use the horizontal line detection from the link above and then use the cv2.putText to write a string like this "QQQQQQQ" at the start-position of the line. Then I search for the Qs which are recognized by OCR and replace them with underscores again.

Armaan Priyadarshan · Answer 2 · 2022-07-13T17:54:06.787

I had a similar problem, and I looked into solving it with OpenCV rather than an OCR library as shaman said. I tried horizontal line detection but it didn't accurately count the number of underscores. OpenCV ended up having a LineSegmentDetector (4.6 has it) which worked really well for me.

LineSegmentDetector in Opencv 3 with Python

The length of lines as a list divided by 2 gave me the number of underscores in the image. Additionally, it took a bit of image preprocessing for it to work properly. This included thresholding, upscaling, and dilation, but those parts shouldn't be hard to figure out.

Can tesseract correctly recognise underscores in images?

2 Answers2