pytesseract using tesseract 4.0 numbers only not working

Question

Any one tried to get numbers only calling the latest version of tesseract 4.0 in python?

The below worked in 3.05 but still returns characters in 4.0, I tried removing all config files but the digits file and still didn't work; any help would be great:

im is an image of a date, black text white background:

import pytesseract
im =  imageOfDate
im = pytesseract.image_to_string(im, config='outputbase digits')
print(im)

Add image to the question for answerers to see your problem. — thewaywewere, Oct 08 '17 at 13:03
I went with https://stackoverflow.com/questions/9413216/simple-digit-recognition-ocr-in-opencv-python/9620295#9620295 instead. — Cees Timmerman, Jun 07 '19 at 10:01
@CuriousGeorge: Did you find a solution to your upgrade problem? — Jarl, Sep 30 '19 at 15:52
Upgrading to v4.1.1 did not help me properly. I also had to download the `tessdata_fast` version of the `trainddata` files. I am attaching a detailed [shell script](https://gist.github.com/ariG23498/b3e46c6e4eaf4da8301e4cae3138987c) to install 4.1.1 from the source. — Aritra Roy Gosthipaty, Jun 16 '21 at 13:11

thewaywewere · Answer 1 · 2018-06-27T16:24:45.617

16

You can specify the numbers in the tessedit_char_whitelist as below as a config option.

ocr_result = pytesseract.image_to_string(image, lang='eng', boxes=False, \
           config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

Hope this help.

edited Jun 27 '18 at 16:24

answered Oct 05 '17 at 15:38

thewaywewere

8,128
11
41
46

"oem" in config argument is mistyped as "eom" – Jakub Mendyk Jun 27 '18 at 13:56
8

This solution doesn't work for tesseract 4.0+. There's an open issue related to this on GitHub: https://github.com/tesseract-ocr/tesseract/issues/751. – Jakub Mendyk Jun 27 '18 at 13:57
Tried to fix the typo on May but somehow still showed `--eom`. Anyway, re-fixed it. – thewaywewere Jun 27 '18 at 16:25
As Jakub mentioned it won't work with 4.0. Instead there is a separate tessdata file for digits – Dmitrii Z. Jun 27 '18 at 21:12
I'm looking for OCR for recognizing time. E.g. **11:25** . Adding a colon (:) to the whitelist didn't work. Any ideas? – Alaa M. Aug 07 '19 at 07:21

score 11 · Answer 2 · answered Mar 06 '19 at 19:31

11

Using tessedit_char_whitelist flags with pytesseract did not work for me. However, one workaround is to use a flag that works, which is config='digits':

import pytesseract
text = pytesseract.image_to_string(pixels, config='digits')

where pixels is a numpy array of your image (PIL image should also work). This should force your pytesseract into returning only digits. Now, to customize what it returns, find your digits configuration file, on Windows mine was located here:

C:\Program Files (x86)\Tesseract-OCR\tessdata\configs

Open the digits file and add whatever characters you want. After saving and running pytesseract, it should return only those customized characters.

answered Mar 06 '19 at 19:31

Robert Harris

249
1
4
8

1

what if I need text and digits ? – Yaroslav Dukal Jul 12 '19 at 23:26
you can put both text and digits in the digits config file. For example, you could put '1234567890abcdefg...' and it will only return those alphanumeric characters. – Robert Harris Oct 29 '19 at 15:43
Which version are you using ?? the method " config='digits' " doesen't wor for me im usin pytesseract==0.3.0 – Ganesh Kharad Feb 27 '20 at 12:59
Works with the latest tesseract as of 2020 – Ammar H Sufyan May 25 '20 at 10:40
`config=digits` only do the whitelisting for numeric from alphanumeric input. How to treat an image as only numeric instead of alphanumeric, any ideas? Like treat `l` as `one` instead of `L` – ircham Jul 02 '20 at 14:47

score 5 · Answer 3 · edited Jun 02 '20 at 22:34

5

You can specify the numbers in the tessedit_char_whitelist as below as a config option.

ocr_result = pytesseract.image_to_string(image, lang='eng',config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

edited Jun 02 '20 at 22:34

Jason Aller

3,541
28
38
38

answered Jun 02 '20 at 21:35

Tejesh Teju

117
1
4

score 2 · Answer 4 · answered Mar 29 '20 at 21:24

2

As you can see in this GitHub issue, the blacklist and whitelist doesn't work with tesseract version 4.0.

There are 3 possible solutions for this problem, as I described in this blog article:

Update tesseract to version > 4.1
Use the legacy mode as described in the answer from @thewaywewere

Create a python function which uses a simple regex to extract all numbers:

def replace_chars(text):
    list_of_numbers = re.findall(r'\d+', text)
    result_number = ''.join(list_of_numbers)
    return result_number

result_number = pytesseract.image_to_string(im)

answered Mar 29 '20 at 21:24

mhellmeier

1,982
1
22
35

1

Thanks! Updating to version 4.1.1 from source has solved the problem. https://github.com/tesseract-ocr/tesseract/releases – Doğuş Sep 16 '20 at 17:14
Bad solution - This is to filter out Text after being detected , totally wrong way. – Phyo Arkar Lwin Jun 28 '22 at 08:41

pytesseract using tesseract 4.0 numbers only not working

4 Answers4

Linked