Pytesseract OCR multiple config options

Question

I am having some problems with pytesseract. I need to configure Tesseract to that it is configured to accept single digits while also only being able to accept numbers as the number zero is often confused with an 'O'.

Like this:

target = pytesseract.image_to_string(im,config='-psm 7',config='outputbase digits')

score 170 · Accepted Answer · edited Dec 23 '22 at 23:42

170

tesseract-4.0.0a supports below psm. If you want to have single character recognition, set psm = 10. And if your text consists of numbers only, you can set tessedit_char_whitelist=0123456789.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
                        bypassing hacks that are Tesseract-specific.

Here is a sample usage of image_to_string with multiple parameters.

target = pytesseract.image_to_string(image, lang='eng', boxes=False, \
        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

edited Dec 23 '22 at 23:42

starball

20,030
7
43
238

answered Jun 19 '17 at 14:05

thewaywewere

8,128
11
41
46

6

It is not a new question. It is a follow up of your solution which has direct inference from what you provided. It would help if you care to mention what version of tesseract you used to use the parameter for whitelist. Please read my comment again, you will understand. – SKR Oct 23 '18 at 02:22
5

For anyone who wants to know what oem means, click here https://wilsonmar.github.io/tesseract/ – X.C. Mar 05 '20 at 03:41
may I ask you to have a look at a Tesseract related question here : https://stackoverflow.com/questions/66946835/improving-accuracy-in-python-tesseract-ocr? – Istiaque Ahmed Apr 05 '21 at 08:47
Could also use `string.digits` (from the Python strings module) rather than hardcoding. https://docs.python.org/3/library/string.html – Raleigh L. Oct 28 '22 at 04:40

score 14 · Answer 2 · answered Jul 05 '21 at 16:02

Page segmentation modes:

Orientation and script detection (OSD) only.
Automatic page segmentation with OSD.
Automatic page segmentation, but no OSD, or OCR. (not implemented)
Fully automatic page segmentation, but no OSD. (Default)
Assume a single column of text of variable sizes.
Assume a single uniform block of vertically aligned text.
Assume a single uniform block of text.
Treat the image as a single text line.
Treat the image as a single word.
Treat the image as a single word in a circle.
Treat the image as a single character.
Sparse text. Find as much text as possible in no particular order.
Sparse text with OSD.
Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

OCR Engine modes:

Legacy engine only.
Neural nets LSTM engine only.
Legacy + LSTM engines.
Default, based on what is available.

score 4 · Answer 3 · answered Feb 09 '19 at 22:40

4

The reason you are having trouble is because character restriction does not work in version 4.0. You have to force legacy mode (oem 0) to have it limit found characters. There is a bug somewhere in the tesseract team that they have not yet addressed.

answered Feb 09 '19 at 22:40

RALPH BURLESON

41
1

I've tried this with oem=0, it doesn't work as well. However there are three options: tessedit_char_blacklist Blacklist of chars not to recognize tessedit_char_whitelist Whitelist of chars to recognize tessedit_char_unblacklist List of chars to override tessedit_char_blacklist – Vilq Aug 30 '19 at 13:40
Fixed in 4.1 I think? – jtlz2 Sep 04 '19 at 09:41

Matt · Answer 4 · 2021-03-19T09:53:08.217

1

Tesseract version 5.0.0-alpha can use the following command: (use psm=13 and oem=1 or 3)

pytesseract.image_to_string(export_image ,lang='eng', config='--psm 13 --oem 1 -c tessedit_char_whitelist=ABCDEFG0123456789')

Note that eng trained dataset is taken: https://github.com/tesseract-ocr/tessdata_fast/blob/master/eng.traineddata

Note:Tested on binary input images of +-60x60px with single character

edited Mar 19 '21 at 09:53

answered Mar 19 '21 at 09:43

Matt

399
3
6

Pytesseract OCR multiple config options

4 Answers4

Linked