Limit characters tesseract is looking for

Question

Is it possible to limit the set of characters that tesseract is looking for (e.g. search only for letters a-z)? That would improve my results greatly.

score 91 · Accepted Answer · edited Feb 19 '21 at 16:25

91

Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs
or
/usr/share/tesseract-ocr/tessdata/configs

And add this line to the config file:

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz

...or maybe [a-z] works. I don't know. Then call tesseract similar to this:

tesseract input.tif output nobatch letters

That will limit tesseract to recognize only the wanted characters.

edited Feb 19 '21 at 16:25

peterh

11,875
18
85
108

answered Jun 06 '10 at 06:08

Blomman

934
8
2

Sorry for the late answer - this helped. Thank you :) By the way, the regex did not work. It was probably interpreted literally. – Danilo Bargen Jul 11 '10 at 09:09
tessedit_char_whitelist 0123456789, i did this to fetch numbers from an image but out of 20 digits only 4 were correct.Any help would be greatly appreciated!!thank u – Swati Oct 01 '10 at 10:50
1

SWATI: what kind of image is it? try cleaning up the source image. for example using imagemagick. – Danilo Bargen Oct 21 '10 at 12:27
2

Hugely helpful! I would say the Tesseract documentation is terrible, but really the word I'm looking for is "non-existent" Thanks! – zorlack Feb 08 '12 at 16:39
@DaniloBargen what do you mean by cleaning up the source image? – Apr 08 '14 at 13:46
`-1` Does not work. No such a file... if i create custom config and use it, it does not take any effect. – Flash Thunder Jan 09 '15 at 15:58
I'm on windows and my initial stab at creating this file did not work until I brought it up in my 'TextPad' editor and changed it from UTF-8 to ANSI (I also changed it from IBM PC to UNIX format) – bkwdesign Oct 29 '20 at 15:12

score 31 · Answer 2 · edited Mar 04 '20 at 01:03

31

To use whitelist in a config file or using the -c tessedit_char_whitelist=... command-line switch, in the newest 4.0 version you will have to set OCR Engine mode to the "Original Tesseract only". This is because the new "Neural nets LSTM" mode doesn't respect the whitelist setting. Example of proper command-line for 4.0 version:

tesseract input_file output_file --oem 0 -c tessedit_char_whitelist=abc123

UPDATE: In newer versions (4.0) there's corrupted eng.traineddata file installed by default by Windows and some Linux installers. Temporary solution is to replace tessdata\eng.traineddata file with one from older version. This file should be about 30MB. Otherwise you'll get Error: "Tesseract couldn't load any languages!" or similar.

Update from tesseract 4.1.1

However, in tesseract 4.1.1 the above bug is fixed, that is, in tesseract 4.1.1 the following works like a charm

tesseract my_image.jpg stdout -l mylang configfile myconfig

Where "myconfig" is a plaintext file located in TESSDATA/configs

load_system_dawg false
load_freq_dawg false
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

edited Mar 04 '20 at 01:03

Hasan

300
1
9

answered Feb 28 '18 at 13:39

Bartłomiej Uliasz

511
4
11

1

I am using pytesseract as pyt and getting following errors when followed the above advice `pyt.image_to_data(im_gray_res, config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --psm 11 --oem 0')` as `pytesseract.pytesseract.TesseractError: (1, "Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")`. Any ideas on how to accomplish detection with only required set of chars? – SKR Oct 21 '18 at 05:45
Have you tried using some simple command like `pytesseract.image_to_data(Image.open('test.png'))` without additional arguments? Because the error doesn't seem to be related to the character whitelist itself. – Bartłomiej Uliasz Oct 29 '18 at 23:25
Yes, I tried everything, in fact CLI for tesseract too but I read somewhere that character whitelist is not respected with tesseract 4.0. So I tried giving option oem 0 but then it doesn't even execute. Can you check with --oem 0 option on your side please? – SKR Oct 30 '18 at 02:36
Just tried it and it works fine here. I'm using `tesseract 4.0.0-alpha.20180109`. The command I've used: `tesseract test.png stdout --oem 0 -c tessedit_char_whitelist=abc123` – Bartłomiej Uliasz Oct 31 '18 at 20:50
Oh really, when I installed tesseract on Ubuntu through CLI it installed by default `tesseract 4.0.0-beta.4-138-g2093`. Why did you try on alpha version? Also, do you see in the output only the characters in the whitelist? Did you try with `--oem 1/2/3`? – SKR Nov 01 '18 at 03:37
2

Yes, you're right. In newer version there's corrupted `eng.traineddata` file. I've tried the newest 4.0 version and got the same error. Temporary solution is to replace `tessdata\eng.traineddata` file with one from older version. This file should be about 30MB (not 4MB like the one installed on 4.0 version). – Bartłomiej Uliasz Nov 13 '18 at 00:29
That was a nice insight about observing the size of the traineddata !!! Thx. I will do like that you have suggested and let us see whether it respects the characters but I was wondering if the size is only 4MB then how tesseract 4.0 is actually recognizing characters? I guess replacing `tessdata\eng.traineddata` from older version should be quick and straightforward? – SKR Nov 14 '18 at 17:46
2

Yes, I have just tried newest version of this file from the GitHub project [link](https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata) and replaced the one I had in `tessdata/eng.trainedddata` with the downloaded one, and all worked flawlessly on 4.0 version. – Bartłomiej Uliasz Nov 21 '18 at 04:13
@SKR: The order of options matter in the `config` parameter. `tesseract`'s man-page mentions "Nota Bene: The options -l lang and --psm N must occur before any configfile." – munikarmanish Dec 04 '18 at 10:18
Tested and works with eng.traineddata (best - [link here](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files)) and Tesseract Open Source OCR Engine 5.0.0-alpha with Leptonica. Command I used was: `tesseract.exe clean.png cleanOut --dpi 160 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyz_1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ,. "` please note that I provided space at the end as white character, hence quote escaping was needed. Dpi is optional. – Zenon Dec 27 '19 at 15:31

jmunsch · Answer 3 · 2021-08-23T15:30:49.747

26

In addition to the config file, is the -c flag:

tesseract stdin stdout -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz -psm 6

update

confirmed working on versions:

4.1.1

edited Aug 23 '21 at 15:30

answered Sep 08 '16 at 09:34

jmunsch

22,771
11
93
114

2

Even when I set that to plain vanilla letters I see messages 'Detected 31 diacritics'. That's odd, since I haven't included any diacritics or accented characters in the whitelist. – Ed Avis Apr 25 '17 at 09:11
@EdAvis See: https://github.com/tesseract-ocr/tesseract/wiki/FAQ#diacritics-above-and-below-the-glyph-are-ignoredcause-garbage-output might have to do with version numbering. It would require more research on the version numbering for me to fully understand, but bumping the version, researching the shell version, and unicode handling, or utf*, might show some clues. Sorry i dont have a complete answer. – jmunsch Oct 30 '17 at 06:23
whitelist not supported in tesseract 4 – wolfgang Feb 06 '19 at 19:31
2

I can confirm this works well with Tesseract 4.1.1 on Linux. – Kingsley Oct 23 '20 at 00:47
1

this works on Mac via Homebrew which uses Tesseract 4.1.1 – kcdragon Aug 23 '21 at 02:31

score 10 · Answer 4 · answered Mar 21 '17 at 13:03

10

Just adding this for anyone using tesseract on Android. In your readOCR function where you set the language etc. add the following line;

tesseract.setVariable("tessedit_char_whitelist","ABCDEFGHIJKLMNOPQRSTUVWXYZ");

you can also do blackList for characters to exclude.

answered Mar 21 '17 at 13:03

user3244591

113
1
5

For those using tess4j (the Java wrapper) use `tesseract.setTessVariable()` – Pranav Nov 16 '18 at 17:28

score 2 · Answer 5 · answered Apr 26 '19 at 04:24

2

In Tesseract version 4.00, this can't be done. You only can fine-tune your model or use regex to remove extra characters from the prediction.

answered Apr 26 '19 at 04:24

Andrew Ravus

451
1
7
14

No longer true? – jtlz2 Sep 04 '19 at 13:35
Is it impossible to use a whitelist using tesseract 4.0.0 running in Legacy OEM? is this a bug in tesseract 4.0.0 that is fixed in 4.1.1? Can you confirm? – Yep Nov 13 '20 at 03:49

score 2 · Answer 6 · answered Apr 11 '20 at 06:15

I am using Ubuntu 18.04.4 LTS. The default tesseract is version 4. I can not use whitelist with it. Then I upgrade it to version 5. Then I use below command and it worked.

tesseract sample.jpg stdout -l eng --oem 3 --psm 7
Warning: Invalid resolution 0 dpi. Using 70 instead.
LL £036 GL)

tesseract sample.jpg stdout -l eng --oem 3 --psm 7 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
Warning: Invalid resolution 0 dpi. Using 70 instead.
L4036GL

sample.jpg

score 0 · Answer 7 · answered Oct 29 '20 at 15:27

My answer is derived wholly from the accepted answer, and is added here to benefit any .NET windows developers using the Tesseract NuGet package - however, take note of my bullet 2 which applies to anybody using any kind of Tesseract on Windows

Create a config folder inside your tessdata folder where the other training data is located.
Add a letters file inside the config folder. Use an editor like TextPad that will help you save it in UNIX format, ANSI encoding (I had initially tried UTF-8 / IBM PC and tesseract was puking an error into my Tests output)
Just like your training files, ensure the letters file, in the Properties panel has a Build Action set to Content and further marked to copy to the output directory:
Invoke your tesseract engine class thusly:

 var ocrEng = new TesseractEngine("./tessdata", "eng", EngineMode.Default, "letters");

Limit characters tesseract is looking for

7 Answers7

Update from tesseract 4.1.1

Linked