Tesseract returns sometimes an empty string if an unknown character occurs

Question

I'm using Tesseract to detect a black written word with a white background. In some images after the black word occurs an info symbol. I'm not interested in detecting this symbol, I'm only interested in the word. Sometimes the info symbol (with a circle around) is detected as 0 or O, this is fine. But in other cases (probably if tesseract doesn't know how to handle this sign) it is just returning an empty string, so the word is not returned as well. I used the code given here and also tried the configuration suggested here

from PIL import Image
import pytesseract
import argparse
import cv2
import os
import numpy as np

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
    help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh",
    help="type of preprocessing to be done")
args = vars(ap.parse_args())

# load the example image and convert it to grayscale
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# check to see if we should apply thresholding to preprocess the
# image
if args["preprocess"] == "thresh":
    gray = cv2.threshold(gray, 0, 255,
        cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

# make a check to see if median blurring should be done to remove
# noise
elif args["preprocess"] == "blur":
    gray = cv2.medianBlur(gray, 3)

# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)

# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(gray, config='--psm 7')
os.remove(filename)
print("Output: " + text)

If anyone has an idea what else I could to I'm very grateful!

Relevant [can-tesseract-be-trained-for-non-font-symbols](https://stackoverflow.com/questions/43450237/can-tesseract-be-trained-for-non-font-symbols) — stovfl, Aug 30 '19 at 10:03
Thanks for your proposal. Unfortunately I don't have a dataset to train a network. I also don't need to detect the info symbol, I just want that the word is not getting lost :-/ — csi, Sep 02 '19 at 07:44
My thought was, `tessaract` sees the symbol as **unknown** character, discard the result and return a ***"empty string"***. — stovfl, Sep 02 '19 at 07:54
Yes I think so as well, but is there any possibility to get the string even if there is an unknown character without discarding the result? Or giving back a questionmark for the symbol or whatever if there is something it doesn't know what to do with? — csi, Sep 04 '19 at 07:04
Have you tried other `--psm X` options like `11 = Sparse text.` [tesseract-ocr](https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc) — stovfl, Sep 04 '19 at 08:02
Relevant [opencv - template_matching](https://docs.opencv.org/3.3.1/d4/dc6/tutorial_py_template_matching.html), remove the `sign` before doing `OCR`. — stovfl, Sep 04 '19 at 13:23
I got it! I have to use config='-psm 7' with only one "-" and not two, I think this was a small error in the other post. Thanks for this hint to try the other psm options, that was why I started to google this again :-D it works with the values 6 and 7 the best. 1 to 5 is not working and the others slur my blanc character. Thanks a lot, I'm very happy! — csi, Sep 04 '19 at 16:24

score 0 · Accepted Answer · answered Sep 16 '19 at 14:00

0

solved: config has to be config='-psm 7' with only one "-"

answered Sep 16 '19 at 14:00

csi

230
1
7
22

Tesseract returns sometimes an empty string if an unknown character occurs

1 Answers1