17

Any one know how to use the user patterns (user_patterns_suffix) in Tesseract? Could you advise me how to do with it and how to test it working? I tried to follow Tesseract guide (Tesseract user-patterns but I didn't see it affected the result at all.

Thanks.

Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
kha nguyen
  • 181
  • 1
  • 2
  • 5
  • Did you try to append the `bazaar` config file? See [tesseract(1)](http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data) – pvorb Sep 05 '14 at 09:47

1 Answers1

26

Tesseract uses a pattern to a a sort of "regular expression". It can be used if lets say you were scanning a book with data that was all in the same format. A pattern can be used to tell Tesseract what formats to expect, ike how it expect words in user-words. Below is how Tesseract describes how to use patterns:

Each pattern can contain any non-whitespace characters, however only the patterns that contain characters from the unicharset of the corresponding language will be useful.

The only meta character is \. To be used in a pattern as an ordinary string it should be escaped with \ (e.g. string C:\Documents should be written in the patterns file as C:\\Documents).

This function supports a very limited regular expression syntax. One can express a character, a certain character class and a number of times the entity should be repeated in the pattern.

To denote a character class use one of:

  • \c - unichar for which UNICHARSET::get_isalpha() is true (character)
  • \d - unichar for which UNICHARSET::get_isdigit() is true
  • \n - unichar for which UNICHARSET::get_isdigit() and UNICHARSET::isalpha() are true
  • \p - unichar for which UNICHARSET::get_ispunct() is true
  • \a - unichar for which UNICHARSET::get_islower() is true
  • \A - unichar for which UNICHARSET::get_isupper() is true

\* could be specified after each character or pattern to indicate that the character/pattern can be repeated any number of times before the next character/pattern occurs.

Examples:

1-8\d\d-GOOG-411 will be expanded to strings: 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.

"ww.\n\*.com" will be expanded to strings like: "ww.a.com" "ww.a123.com" ... "ww.ABCDefgHIJKLMNop.com"

Note: In choosing which patterns to include please be aware of the fact providing very generic patterns will make tesseract run slower. For example \n\* at the beginning of the pattern will make Tesseract consider all the combinations of proposed character choices for each of the segmentations, which will be unacceptably slow. Because of potential problems with speed that could be difficult to identify, each user pattern has to have at least kSaneNumConcreteChars concrete characters from the unicharset at the beginning.

David Röthlisberger
  • 1,786
  • 15
  • 20
Stuart N. Thomas
  • 466
  • 5
  • 10
  • @Federinik Thanks a lot! Spent a lot of time looking for this, finally found it! – Shivam Gaur Mar 07 '17 at 05:41
  • I'm trying to use Tesseract to read name, address, and DOB from drivers licenses. Just running it against the image is not getting me good results, as the text is all run together, without even line breaks to separate things. It seems like patterns would help me since it can look in the same place every time for the DOB or the name. Can anyone help me apply this? – Tanoshimi Jun 13 '17 at 12:48
  • It looks like this feature is currently broken in Tesseract 4.0 alpha (LSTM) (also char-whitelist seems to be broken) https://github.com/tesseract-ocr/tesseract/issues/960 – NightFury13 Apr 18 '18 at 07:52
  • 12
    This does not specifically explain *how* to use patterns – Michel Apr 26 '19 at 16:04
  • Thanks for pasting from the documentation, but it doesn't explain how to use it. It is just a definition. – Muhammad Uzair Oct 08 '22 at 16:54