I am using tesseract
for OCR, via the pytesseract
bindings. Unfortunately, I encounter difficulties when trying to extract text including subscript-style numbers - the subscript number is interpreted as a letter instead.
For example, in the basic image:
I want to extract the text as "CH3", i.e. I am not concerned about knowing that the number 3
was a subscript in the image.
My attempt at this using tesseract
is:
import cv2
import pytesseract
img = cv2.imread('test.jpeg')
# Note that I have reduced the region of interest to the known
# text portion of the image
text = pytesseract.image_to_string(
img[200:300, 200:320], config='-l eng --oem 1 --psm 13'
)
print(text)
Unfortunately, this will incorrectly output
'CHs'
It's also possible to get 'CHa'
, depending on the psm
parameter.
I suspect that this issue is related to the "baseline" of the text being inconsistent across the line, but I'm not certain.
How can I accurately extract the text from this type of image?
Update - 19th May 2020
After seeing Achintha Ihalage's answer, which doesn't provide any configuration options to tesseract
, I explored the psm
options.
Since the region of interest is known (in this case, I am using EAST detection to locate the bounding box of the text), the psm
config option for tesseract
, which in my original code treats the text as a single line, may not be necessary. Running image_to_string
against the region of interest given by the bounding box above gives the output
CH
3
which can, of course, be easily processed to get CH3
.