5

I'm using Tesseract and I want to develop an app that is able to recognize a sequence of characters. I had good results but not exellent.

The characters sequence I want to read has always a specific pattern, let's say:

number number number char char - (e.g.: 123AB)

Is there a way to "tell" the ocr engine that the structure is always fixed, in order to improve the results of the recognition?

Thank you in advance.

stei2348
  • 51
  • 1
  • 3
  • This post ["Limit characters tesseract is looking for"](http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for) maybe of some use to you – DMK Feb 13 '13 at 16:48
  • Thank you, I had a look, but it didn't help. The point is that in my sequence I can have every possible char [A-Z] and numbers [0-9], so I cannot use any limitation. The only information I have is that the first 3 characters are numbers, while the last 2 are characters. – stei2348 Feb 14 '13 at 09:27
  • @stei2348: you can do some post-processing of the resulting string, for example converting I to 1 and vice versa. Or preprocess the source image. – Karol S Dec 09 '13 at 23:52

2 Answers2

3

Try bazaar matching pattern in Tesseract:

\d\d\d\c\c
jtlz2
  • 7,700
  • 9
  • 64
  • 114
nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • According to the doc: "Note: In choosing which patterns to include please be aware of the fact providing very generic patterns will make tesseract run slower... Because of potential problems with speed that could be difficult to identify, each user pattern has to have at least kSaneNumConcreteChars concrete characters from the unicharset at the beginning." Meaning the pattern will be ignored because it has less than 4 concrete characters. 4 is the current hardcoded value for kSaneNumConcreteChars. – mschmoock Jul 29 '15 at 08:58
  • I just had a look at GitHub, and it seems that kSaneNumConcreteChars is now hardcoded as "0". See for yourself here: https://github.com/tesseract-ocr/tesseract/blob/master/dict/trie.h – katzenhut Jan 27 '17 at 15:29
0

You can use the "tessedit_char_whitelist" parameter