17

Does anyone know how to set the character whitelist for Pytesseract? I want it to only output A-z and 0-9. Is this possible? I have the following:

img = Image.open('test.jpg')
result = pytesseract.image_to_string(img, config='-psm 6')

I'm getting other characters like / for a 1 so I would like to limit the options of possible characters.

Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29
Minato10
  • 173
  • 1
  • 1
  • 4

1 Answers1

27

You can accomplish that with the below line. Or you can setup the config file for tesseract to do the same thing Limit characters tesseract is looking for

pytesseract.image_to_string(question_img, config="-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyz -psm 6")

I am sure there are other ways to get it work, but this is what worked for me.

Community
  • 1
  • 1
James Vaughn
  • 386
  • 3
  • 2
  • 7
    For future reference: `tessedit_char_whitelist`'s value is case sensitive so to capture `aA-zZ0-9`, you would need the full `01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz` – Cole Mar 20 '18 at 19:51
  • 2
    @Cole Does the above answer still valid? I tried `pyt.image_to_data(im_gray_res, lang='eng', config='-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --psm 11 --oem 3')` but still gets in results `|` for `I` and `l` for `I` ? – SKR Oct 21 '18 at 05:33
  • @SKR You included `I` and `1` in your `tessedit_char_whitelest`, so that is expected. You might confusing a whitelist with a blacklist? – Cole Oct 21 '18 at 16:21
  • @Cole I think whitelist means to direct tesseract to only give in results the characters from the set, right? I am getting in results, pipe character `|` and small case L `l`. That's what I mentioned that I still get results from outside the set. – SKR Oct 21 '18 at 20:54
  • @SKR Oh, I get what you mean now. Sadly, I don’t know why that’s happening. I recommend opening a new question on here, someone likely knows :-) – Cole Oct 21 '18 at 22:30
  • @RexLow No, not yet, actually I tried much but there seems to be some problem with pytesseract itself in filtering out unwanted characters. I intend to comeback to this problem and if I find anything I will post here. – SKR Apr 15 '19 at 23:41
  • With tesseract 4.0.0 do not work but with tesseract 4.1.0 work OK. – bedna Dec 27 '19 at 08:18
  • Try '--psm' instead of '-psm'. – spiralmoon Jul 21 '20 at 18:29
  • 1
    How can I give "space" in whitelist? – Rishabh Gupta Dec 22 '20 at 08:53
  • how can i give `space` in blacklist? – Muhammad Uzair Oct 07 '22 at 19:11
  • by writing a space character and setting the whole whitelist between quotes `-c tessedit_char_whitelist='ABCDEFGHIJKLMNOPQRSTUVWXYZ '` – Tyrannas Dec 13 '22 at 16:55