PyTesseract not recognizing decimals

Question

This is not truly a duplicate of How to extract decimal in image with Pytesseract, as those answers did not solve my problem and my use case is different.

I'm using PyTesseract to recognise text in table cells. When it comes to recognising drug doses with decimal points, the OCR fails to recognise the ., though is accurate for everything else. I'm using tesseract v5.0.0-alpha.20200328 on Windows 10.

My pre-processing consists of upscaling by 400% using cubic, conversion to black and white, dilation and erosion, morphology, and blurring. I've tried a decent combination of all of these (as well as each on their own), and nothing has recognized the ..

I've tried --psm of various values as well as a character whitelist. I believe the font is Sergoe UI.

Before processing:

After processing:

PyTesseract output: 25mg »p

Processing code:

import cv2, pytesseract
import numpy as np

image = cv2.imread( '01.png' )
upscaled_image = cv2.resize(image, None, fx = 4, fy = 4, interpolation = cv2.INTER_CUBIC)
bw_image = cv2.cvtColor(upscaled_image, cv2.COLOR_BGR2GRAY)

kernel = np.ones((2, 2), np.uint8)
dilated_image = cv2.dilate(bw_image, kernel, iterations=1)
eroded_image = cv2.erode(dilated_image, kernel, iterations=1)

thresh = cv2.threshold(eroded_image, 205, 255, cv2.THRESH_BINARY)[1]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
morh_image = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
            
blur_image = cv2.threshold(cv2.bilateralFilter(morh_image, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

final_image = blur_image
text = pytesseract.image_to_string(final_image, lang='eng', config='--psm 10')

EasyOCR works, but it's VERY slow – Edge Oct 05 '20 at 08:45 — Edge, Oct 05 '20 at 08:45

score 1 · Answer 1 · answered Oct 06 '20 at 10:10

If you haven't made sure of this, check out this link

visit https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/xk2ErJnFBQAJ

One major solution for for many problems is text height, I was facing many issues but wasn't able to figure out why, but seems sending image with correct size letters to tesseract solves many problems. instead of upscaling to a random % try the number with which your image has letters close to 30- 40 Px.

Also if somehow your preprocessing change "." into a noise like char then too it will get ignored.

score 0 · Answer 2 · answered Dec 30 '20 at 18:55

I had a similar case that and was able to increase the number of correct decimals by using image processing methods and upscaling of the image. Yet, a small share of the decimals were not recognized correctly.

The solution I found was to change the language setting for pytesseract:

I was using a non-English setting, but changing the config to lang='eng' fixed all remaining issues.

That might not help with the original question, though, as the setting is already eng.

PyTesseract not recognizing decimals

2 Answers2