How to restrict the recognized characters in tesserocr?

Question

When using tesserocr how do I limit the set of characters that Tesseract will recognize to just the digits?

I know from this that if I were using c++ I could set a tessedit_char_whitelist in the config file, but I don't know the analogous approach in tesserocr within Python.

In general, the tesserocr documentation gives help that works if the reader already knows the Tesseract API for c++. As I am not fluent in c++, I am hoping to avoid having to read the c++ source code in order to use tesserocr.

If anyone can give me what I actually need to write in python or a general rule for going from config settings to Python code that would be great. Thanks in advance.

sinecode · Accepted Answer · 2018-04-17T19:40:55.337

3

Tesserocr works as the C++ API, you can set a whitelist with the function SetVariable.

An example:

from tesserocr import PyTessBaseAPI
from string import digits

with PyTessBaseAPI() as api:
    api.SetVariable('tessedit_char_whitelist', digits)
    api.SetImageFile('image.png')
    print api.GetUTF8Text()  # it will print only digits

If you want another approach that is more straightforward and independent from the C++ API, try with the pytesseract module.

An example with pytesseract:

import pytesseract
from PIL import Image
from string import digits

image = Image.open('image.png')
print pytesseract.image_to_string(
    image, config='-c tessedit_char_whitelist=' + digits)

edited Apr 17 '18 at 19:40

answered Apr 17 '18 at 19:29

sinecode

743
1
7
17

This seems to completely answer the question. Thanks. Are you aware of a source that would provide an introduction to the range of things Tesseract can do. Most of what I have found is either about installing or refers one to the C++ source code or man page. – WesR Apr 17 '18 at 22:04
Have you looked at [this](https://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/)? It seems a good introduction to Tesseract with Python. – sinecode Apr 17 '18 at 22:12
For some reason, we're not supposed to say T.H.A.N.K.S on SO. If I could, however, I would definitely do so. :-) – WesR Apr 17 '18 at 23:06
@ceccoemi Does your advice still hold for tesseract 4.0 with oem 3? – SKR Oct 21 '18 at 06:26
@SKR It does not. The LSTM they are using does not allow simply removing letters for its output. – Belval Mar 01 '19 at 16:18

How to restrict the recognized characters in tesserocr?

1 Answers1