tesseract doesnt recognize individual text segments after whitelisting

Question

I have an image I want to extract text from using tesseract and python. I only want to recognize a certain set of characters so I use tessedit_char_whitelist=1234567890CBDE as a config. However now tesseract doesnt seem to recognize the gaps between the lines anymore. Is there some character I can add to the whitelist so it recognizes the text as individual text again?

Here is the image after the whitelist:

Here is the image before the whitelist:

Here is the code responsible for drawing the boxes and the recognizing the characters in case youre wondering:


#configuring parameters for tesseract
# whitlist = "-c tessedit_char_whitelist=1234567890CBDE"
custom_config = r'--oem 3 --psm 6 ' 
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
print(details.keys())

total_boxes = len(details['text'])
for sequence_number in range(total_boxes):
    # confidence above 30 %
    CONFIDENCE = 0
    if int(details['conf'][sequence_number]) >= CONFIDENCE:
        (x, y, w, h) = (details['left'][sequence_number], details['top'][sequence_number], details['width'][sequence_number],  details['height'][sequence_number])
        threshold_img = cv2.rectangle(threshold_img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# display image
cv2.imshow('captured text', threshold_img)
cv2.imwrite("before.png", threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()

EDIT:

Here is the original image I want to extract the text from with the goal to write it to a matrix:

The desired matrix would take the following form:


content = [
    ["1C", "55", "55", "E9", "BD"],
    # ...
    ["1C", "1C", "55", "BD", "BD"]
]

Please provide original image (without boxes) and desired output — user898678, Feb 06 '21 at 15:51

score 1 · Accepted Answer · answered Feb 07 '21 at 15:07

One Solution is:

1. Individually take each tuple and upsample by 2
1. Apply threshold
1. Recognize by setting page-segmentation-mode to 6


Tuple
Threshold
Result	1C	55	55	E9	BO
Tuple
Threshold
Result	1C	1C	55	BO	1C
Tuple
Threshold
Result	1C	55	BO	55	IC
Tuple
Threshold
Result	1C	BD	50	1C	1C
Tuple
Threshold
Result	1C	1C	55	BD	BD

The idea is taking each tuple separately, upsampling it, and then applying inverse-binary-threshold. Tesseract misinterpreted few tuples due to the font. For instance, if you look at the character D which looks like O. If you want 100% accuracy, then I suggest you train the tesseract. Also, make sure you try with other page-segmentation-modes

Here is the array output:

[['1C', '55', '55', 'E9', 'BO'], ['1C', '1C', '55', 'BO', '1C'], ['1C', '55', 'BO', '55', 'IC'], ['1C', 'BD', '50', '1C', '1C'], ['1C', '1C', '55', 'BD', 'BD']]

Code:

import cv2
import pytesseract

img = cv2.imread("IVemF.png")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
s_idx1 = 0  # start index1
e_idx1 = int(h/5)  # end index1
cfg = "--psm 6"
res = []

for _ in range(0, 5):
    s_idx2 = 0  # start index2
    e_idx2 = int(w / 5)  # end index2
    row = []
    for _ in range(0, 5):
        crp = gry[s_idx1:e_idx1, s_idx2:e_idx2]
        (h_crp, w_crp) = crp.shape[:2]
        crp = cv2.resize(crp, (w_crp*2, h_crp*2))
        thr = cv2.threshold(crp, 0, 255,
                            cv2.THRESH_BINARY_INV |
                            cv2.THRESH_OTSU)[1]
        txt = pytesseract.image_to_string(thr,
                                          config=cfg)
        txt = txt.replace("\n\x0c", "")
        row.append(txt.upper())
        print(txt.upper())
        s_idx2 = e_idx2
        e_idx2 = s_idx2 + int(w/5)
        cv2.imshow("thr", thr)
        cv2.waitKey(0)
    res.append(row)
    s_idx1 = e_idx1
    e_idx1 = s_idx1 + int(h/5)

print(res)

Thank you very much, works very well with a few obvious replacements — TheFibonacciEffect, Feb 07 '21 at 16:31
You're welcome. You are right, maybe you would like to train tesseract for the particular tuples. — Ahmet, Feb 07 '21 at 16:32

tesseract doesnt recognize individual text segments after whitelisting

1 Answers1