I have an image I want to extract text from using tesseract and python. I only want to recognize a certain set of characters so I use tessedit_char_whitelist=1234567890CBDE
as a config. However now tesseract doesnt seem to recognize the gaps between the lines anymore. Is there some character I can add to the whitelist so it recognizes the text as individual text again?
Here is the image after the whitelist:
Here is the image before the whitelist:
Here is the code responsible for drawing the boxes and the recognizing the characters in case youre wondering:
#configuring parameters for tesseract
# whitlist = "-c tessedit_char_whitelist=1234567890CBDE"
custom_config = r'--oem 3 --psm 6 '
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
print(details.keys())
total_boxes = len(details['text'])
for sequence_number in range(total_boxes):
# confidence above 30 %
CONFIDENCE = 0
if int(details['conf'][sequence_number]) >= CONFIDENCE:
(x, y, w, h) = (details['left'][sequence_number], details['top'][sequence_number], details['width'][sequence_number], details['height'][sequence_number])
threshold_img = cv2.rectangle(threshold_img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# display image
cv2.imshow('captured text', threshold_img)
cv2.imwrite("before.png", threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()
EDIT:
Here is the original image I want to extract the text from with the goal to write it to a matrix:
The desired matrix would take the following form:
content = [
["1C", "55", "55", "E9", "BD"],
# ...
["1C", "1C", "55", "BD", "BD"]
]