0

When using tesserocr how do I limit the set of characters that Tesseract will recognize to just the digits?

I know from this that if I were using c++ I could set a tessedit_char_whitelist in the config file, but I don't know the analogous approach in tesserocr within Python.

In general, the tesserocr documentation gives help that works if the reader already knows the Tesseract API for c++. As I am not fluent in c++, I am hoping to avoid having to read the c++ source code in order to use tesserocr.

If anyone can give me what I actually need to write in python or a general rule for going from config settings to Python code that would be great. Thanks in advance.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
WesR
  • 1,292
  • 1
  • 18
  • 30

1 Answers1

3

Tesserocr works as the C++ API, you can set a whitelist with the function SetVariable.

An example:

from tesserocr import PyTessBaseAPI
from string import digits

with PyTessBaseAPI() as api:
    api.SetVariable('tessedit_char_whitelist', digits)
    api.SetImageFile('image.png')
    print api.GetUTF8Text()  # it will print only digits

If you want another approach that is more straightforward and independent from the C++ API, try with the pytesseract module.

An example with pytesseract:

import pytesseract
from PIL import Image
from string import digits

image = Image.open('image.png')
print pytesseract.image_to_string(
    image, config='-c tessedit_char_whitelist=' + digits)
sinecode
  • 743
  • 1
  • 7
  • 17
  • This seems to completely answer the question. Thanks. Are you aware of a source that would provide an introduction to the range of things Tesseract can do. Most of what I have found is either about installing or refers one to the C++ source code or man page. – WesR Apr 17 '18 at 22:04
  • Have you looked at [this](https://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/)? It seems a good introduction to Tesseract with Python. – sinecode Apr 17 '18 at 22:12
  • For some reason, we're not supposed to say T.H.A.N.K.S on SO. If I could, however, I would definitely do so. :-) – WesR Apr 17 '18 at 23:06
  • @ceccoemi Does your advice still hold for tesseract 4.0 with oem 3? – SKR Oct 21 '18 at 06:26
  • @SKR It does not. The LSTM they are using does not allow simply removing letters for its output. – Belval Mar 01 '19 at 16:18