4

I want to do handwritten text recognition using the pytesseract library to read a numerical character in images that has an average dimension of 43 * 45 pixels. The following sample image:
image 1 image 2 image 3

expected result:

9
1
4

I want to get a single numerical character from the image.

I've tried this code below

import pytesseract

# loop through images
print(pytesseract.image_to_string("text.jpg", config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789'))

but the real result, I got less than 50% of accuracy or even much lower, some numbers that read well, some that read 2 characters in a single image, some that didn't read.
When I remove the -c tessedit_char_whitelist = 0123456789 configuration, I get the characters4, \, and the letter g.
How to make Pytesseract treat images as an only single numerical character instead of using a whitelist that still reads the text as alphanumeric?

PS: I know that OCR is can't 100% accurate. At least the accuracy can be improved.

ircham
  • 129
  • 13

1 Answers1

0

Accordingly to this GitHub issue, tesseract 4.0 does not support whitelist characters with the LSTM model. You can fix this issue by upgrading Tesseract to the 4.1 version instead of using the legacy model (i.e., --oem flag).

Alternatively, you could try to use the flag config='digits' as proposed by Robert Harris in this answer to force your pytesseract into returning only digits.

This blog article proposes the creation of a python function that uses a simple regex to extract all numbers instead of juggling around with several flags and versions.

D. S.
  • 142
  • 1
  • 1
  • 13
  • i think `config=digits` still doesn't treat images as only numeric. But only do the whitelisting in a different way – ircham Jul 02 '20 at 15:17