How can I maximise the reliability of tesseract ocr for text recognition as much as possible?

Question

I am attempting to collect data from a shop in a game ( starbase ) in order to feed the data to a website in order to be able to display them as a candle stick chart

So far I have started using Tesseract OCR 5.0.0 and I have been running into issues as I cannot get the values reliably

I have seen that the images can be pre-processed in order to increase the reliability but I have run into a bottleneck as I am not familiar enough with Tesseract and OpenCV in order to know what to do more

Please note that since this is an in-game UI the images are going to be very constant as there is no colour variations / light changes / font size changes / ... I technically only need to get it to work once and that's it

Here are the steps I have taken so far and the results :

I have started by getting a screen of only the part of the UI I am interested in in order to remove as much clutter as possible

I have then set a threshold as shown here ( I will also be using the cropping part when doing the automation but I am not there yet ), set the language to English and the psm argument to 6 witch gives me the following code :

import cv2
import pytesseract


def clean_text(text):
    ret = text.replace("\n\n", "\n")  # remove the blank lines
    return ret


pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
img = cv2.imread('screens/ressources_list_array_1.png', 0)
thresh = 255 - cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

print("======= Output")
print(clean_text(pytesseract.image_to_string(thresh, lang='eng', config='--psm 6')))

cv2.imshow('thresh', thresh)
cv2.waitKey()

Here is an example of the output I get :

======= Output
Aegisium Ore 4490 456
Ajatite Ore 600 332
Arkanium Ore 84999 53
Bastium Ore 2350 421
Charodium Ore 5 280 366
Corazium Ore 39 896 212
Exorium Ore 5 380 112
Ice 980 141
Karnite Crystal ele) 111
Kutonium Ore 14 000 215
Lukium Ore 31 000 158
Nhurgite Crystal 3144 64
Surtrite Crystal 4198 70
Valkite Ore 545 150
Vokarium Ore 1850 415
Ymrium Ore 69 899 60

There are two main issues :
1 - It is not reliable enough, you can see it confused 6 000 with ele)
2 - it is not properly understanding where the numbers start and end, making the differentiation of the 2 columns difficult

I think I can solve the second issue by further splitting the image into 3 columns but I am unsure if it's not going to be a big hit on CPU / GPU usage witch I would preferably avoid

I also found the documentation of OpenCV that shows all of the possible Image processing methods but there is a lot and I am unsure on witch ones to use to further increase reliability

Any help is much appreciated

score 2 · Accepted Answer · answered Jan 03 '22 at 23:02

Pytesseract, on its own, doesn't handle table detection very well - the table format isn't retained in the output, which can make it difficult to parse, as seen in your output.

So splitting the table into distinct columns, performing OCR on each, and then rejoining the columns will help. This is slower, but it is more accurate.

Dilation can help, which adds white pixels to existing white areas (using the threshold and image you currently have). This expands the narrow areas of the numbers.

In my experience, to improve the accuracy generally means splitting the table up into different sections, as well as testing different thresholds and dilation settings.

import cv2
import numpy as np
import pandas as pd
def read_img(img):
    '''
    Read in a grayscale image.
    '''
    img = cv2.imread(img)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    return img
img = read_img("img_path.png")
thresh = 255 - cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1] # your current threshold
dilated = cv2.dilate(thresh, np.ones((3,1)), iterations=1) # dilate vertically (don't want to smudge the numbers together)
cols = []
for i, v in enumerate([dilated[:,0:200],thresh[:,200:500],dilated[:,800:900]]): # split image into columns by array slicing
    # Note that the middle column isn't dilated, when so, a decimal point is found
    config_options = '--psm 6'
    cols.append(clean_text(pytesseract.image_to_string(v, lang='eng', config=config_options)).split('\n'))
pd.DataFrame(cols).T
                   0       1    2
0       Aegisium Ore    4490  456
1        Ajatite Ore     600  332
2       Arkanlum Ore   84999   53
3        Bastium Ore    2350  421
4      Charodium Ore   5 280  366
5       Corazium Ore  39 896  212
6        Exorlum Ore   5 380  112
7                Ice     980  141
8    Karnite Crystal   6 000  111
9       Kutonlum Ore  14 000  215
10        Lukium Ore  31 000  158
11  Nhurgite Crystal    3144   64
12  Surtrite Crystal    4198   70
13       Valkite Ore     545  150
14      Vokarlum Ore    1850  415
15        Ymrium Ore  69 899   60

The np.ones provides a kernel for the dilation to use. Documentation.

Lastly, depending on your use case, AWS Textract does a good job parsing tables and numbers, and they provide sample Python code in the documentation to connect to the API, which worked really well for me, at least. Hopefully some of this is helpful.

Nice solution. The spaces in the second column are probably thousands separators and probably unwanted. You might consider changing `thresh[:,200:500]` to `thresh[:,200:500].replace(' ', '')` to remove spaces. Or instead of empty string, you could replace spaces with your preferred thousands separator — bfris, Jan 03 '22 at 23:40

score 1 · Answer 2 · answered Jan 03 '22 at 23:33

Your code actually works quite well. To improve performance, you'll want to feed tesseract the negative of your threshold image. Tesseract prefers black text on a white background. Don't know why.

Your threshold command is already making a "double" negative image. First with the 255 - and second with the cv2.THRESH_BINARY_INV. To fix, you can either

remove 255 -
change cv2.THRESH_BINARY_INV to cv2.THRESH_BINARY argument.

After this change, your text will be perfectly detected:

======= Output
Aegisium Ore 4490 456
Ajatite Ore 600 332
Arkanium Ore 84999 53
Bastium Ore 2350 421
Charodium Ore 5 280 366
Corazium Ore 39 896 212
Exorium Ore 5 380 112
Ice 980 141
Karnite Crystal 6 000 111
Kutonium Ore 14 000 215
Lukium Ore 31 000 158
Nhurgite Crystal 3144 64
Surtrite Crystal 4198 70
Valkite Ore 545 150
Vokarium Ore 1850 415
Ymrium Ore 69 899 60
♀

Well, there is that extra funky character at the end.

Regarding not being able to distinguish between the columns, it shouldn't be too computationally expensive to split the image into three columns and it may lead to easier to write code.

If you'd prefer to solve the columns thing a different way, and if the third column values are always less than 1000, then you can accurately detect which column a number belongs to.

extra funky character is "eng of page" character / the ASCII "Form Feed" (see e.g. https://en.wikipedia.org/wiki/Page_break) witch could be avoided by tesseract config variable `-c page_separator=""` or by python `.replace('\\n\\f', '')` — user898678, Jan 04 '22 at 08:40

How can I maximise the reliability of tesseract ocr for text recognition as much as possible?

2 Answers2