Python - Improving Tesseract OCR to recognize list of names

Question

I'm working on a project that will recognize teams in a game (Overwatch) and record which players were on which team. It has a predefined list of who is playing, it only needs to recognize which image they are located on. So far I have had success in capturing the images for each team and getting a rough output as to the name for each player, however, it is getting several letters confused.

My input images:

And the output I get from OCR:

W THEMIGHTVMRT
ERSVZENVRTTR
ERSVLUCID
ERSVZRRVR
ERSVMEI
EFISVSDMBRR

ERSV RNR
ERSVZENVRTTR
EFISVZHRVR
ERSVMCCREE
ERSVMEI
EHSVRDRDHDG

From this, you can see that the OCR confuses "A" with "R" and "Y" with "V". I was able to get the font file that Overwatch uses and generate a .traineddata file using Train Your Tesseract - I'm aware that there is probably a better way of generating this file, though I'm not sure how.

My code:

    from pytesseract import *
    import pyscreenshot

    pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
    tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'

    team1 = pyscreenshot.grab(bbox=(50,450,530,810)) # X1, Y1, X2, Y2
    team1.save("team1screenshot.png")
    team1text = pytesseract.image_to_string(team1, config=tessdata_dir_config, lang='owf')

    team2 = pyscreenshot.grab(bbox=(800,450,1280,810)) # X1, Y1, X2, Y2
    team2.save("team2screenshot.png")
    team2text = pytesseract.image_to_string(team2, config=tessdata_dir_config, lang='owf')

    print(team1text)
    print("------------------")
    print(team2text)

How should I improve the recognition of these characters? Do I need a better .traineddata file, or is it regarding better image processing?

Thanks for any help!

Do you need to solve this problem at the OCR stage? Since you have the list of correct names you could simply match each name recognized by the OCR to the closest correct name using the [edit distance](https://en.wikipedia.org/wiki/Edit_distance). — Florian Brucker, Jul 13 '17 at 10:49
Doing that could be a possibility, I'll try it and see if it works... — Matthew Winfield, Jul 13 '17 at 10:56
@FlorianBrucker I've attempted it using the algorithm from [this](https://stackoverflow.com/questions/2460177/edit-distance-in-python) post. It sort of works, however, I get several with matching scores (for example EHSVRDRDHDG matches Ana, Lucio, Zarya, Mei, Roadhog and McCree all with a score of 10...). Is there any way of improving this algorithm? — Matthew Winfield, Jul 13 '17 at 11:21
After you've calculated the distance from each OCR-string to each known name you can calculate the optimal mapping using [minimum cost bipartite matching](https://stackoverflow.com/q/4426131/857390). That will minimize the total error. — Florian Brucker, Jul 13 '17 at 12:48

score 0 · Answer 1 · answered Jul 13 '17 at 11:51

0

As @FlorianBrucker mentioned, doing a similarity test on the strings allows (with some fine tuning) the ability to find the correct string after the OCR level.

answered Jul 13 '17 at 11:51

Matthew Winfield

827
4
10
25

score 0 · Answer 2 · answered Jan 15 '21 at 10:55

You could try custom OCR configs to do a sparse text search, "Find as much text as possible in no particular order."

SET psm to 11 in tesseract configs

See if you can do this:

tessdata_dir_config = "--oem 3 --psm 11"

To see a complete list of supported page segmentation modes (psm), use tesseract -h. Here's the list as of 3.21:

Orientation and script detection (OSD) only.
Automatic page segmentation with OSD.
Automatic page segmentation, but no OSD, or OCR.
Fully automatic page segmentation, but no OSD. (Default)
Assume a single column of text of variable sizes.
Assume a single uniform block of vertically aligned text.
Assume a single uniform block of text.
Treat the image as a single text line.
Treat the image as a single word.
Treat the image as a single word in a circle.
Treat the image as a single character.
Sparse text. Find as much text as possible in no particular order.
Sparse text with OSD.
Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

I'm using python wrapper for Tesseract https://github.com/madmaze/pytesseract

Here you can configure tesseract as:

custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

Python - Improving Tesseract OCR to recognize list of names

2 Answers2