Tesseract OCR not recognizing perfectly cut out characters

Question

So I am working on a project in which it is necessary to read characters off of license plates. Given an image of (just) the license plate I'm using openCV to segment the characters and get their bounding boxes. Then the individual characters are cut out and I'd like to use Tesseract to recognize what the characters are.

Problem is: I'm getting really bad results, even though the characters seem perfectly cut out by openCV. I've included some example images below. Tesseract either fails to detect any character at all, or detects entirely wrong characters (I don't mean it confuses a 0 with an O, or 1 and l...it, detects 7, as an example, if there is a 4 clearly visible).

Is there anything I am doing wrong, or have I misunderstood the options I am setting? Help would be greatly appreciated, as I'm not seeing why Tesseract shouldn't recognize these characters.

(I'm using Tesseract OCR v4, in the LSTM mode)

score 0 · Answer 1 · answered Feb 04 '21 at 13:25

You can recognize by pytesseract in two-steps

1. Applying adaptive-threshold
1. Setting page-segmentation-mode to 6

1. Adaptive-threshold

Here, the algorithm determines the threshold for a pixel based on a small region around it. So we get different thresholds for different regions of the same image which gives better results for images with varying illumination. source



Adaptive-threshold result below	Adaptive-threshold result below

`pytesseract` result below	`pytesseract` result below
4	9

Code:

import cv2
import pytesseract

img_lst = ["four.png", "nine.png"]

for pth in img_lst:
    img = cv2.imread(pth)
    img = cv2.resize(img, (28, 28))
    gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                                cv2.THRESH_BINARY_INV, 47, 2)
    txt = pytesseract.image_to_string(thr, config="--psm 6 digits")
    print(txt)

Tesseract OCR not recognizing perfectly cut out characters

1 Answers1