I want to do handwritten text recognition using the pytesseract
library to read a numerical character in images that has an average dimension of 43 * 45 pixels. The following sample image:
expected result:
9
1
4
I want to get a single numerical character from the image.
I've tried this code below
import pytesseract
# loop through images
print(pytesseract.image_to_string("text.jpg", config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789'))
but the real result, I got less than 50% of accuracy or even much lower, some numbers that read well, some that read 2 characters in a single image, some that didn't read.
When I remove the -c tessedit_char_whitelist = 0123456789
configuration, I get the characters4
, \
, and the letter g
.
How to make Pytesseract treat images as an only single numerical character instead of using a whitelist that still reads the text as alphanumeric?
PS: I know that OCR is can't 100% accurate. At least the accuracy can be improved.