Tesseract OCR force pattern

Question

I want to read a specific character sequence with Tesseract like this post : Tesseract OCR: is it possible to force a specific pattern?

I have tried bazaar matching pattern in Tesseract with the pattern \d\d\d\A\A and OCR still recognize other words which doesn't match.

I have tried to use the "tessedit_char_whitelist" parameter but I can't choose the position of the characters with that.

I launch the command : tesseract image.jpg result -l eng bazaar And I have this message :

Please provide at least 4 concrete characters at the beginning of the pattern

Invalid user pattern \A\A\d\d\d

Tesseract Open Source OCR Engine v3.01 with Leptonica

image.jpg :

The result :

  AB123
  ABC12
  A1234
  12345
  ABCD1

So it is wrong, I just wanted to catch the sequence "AB123".

Can somebody tell me why the regular expression in my user-patterns file as no effect ? For the configuration, I have strictly followed the bazaar tutorial.

I believe this error: _Please provide at least 4 concrete characters at the beginning of the pattern_ pretty much explains itself. This is probably a limitation from whatever you are using is. Also try `\w\w\d\d\d`, `\A` is not what you want for all "characters". Try it [here](https://regex101.com/r/uQ3oQ9/1). — Asunez, Aug 07 '15 at 09:42
I tried `\w\w\d\d\d` and I have the same error : Please provide at least 4 concrete characters at the beginning of the pattern Invalid user pattern \w\w\d\d\d. — leoden, Aug 07 '15 at 09:52
I have added 4 concrete characters to my pattern : `TEST\w\w\d\d\d` and tested with the words `TESTAB123 TESTABC12` etc ... I have no more the error _Please provide at least 4 concrete characters at the beginning of the pattern_ but I still have _Invalid user pattern TEST\w\w\d\d\d_. I don't understand why it is invalid — leoden, Aug 07 '15 at 10:11
Because \w\w are not recognize by tesseract. I tried to use \c\c and I have no more error message. But the result is stil wrong, is like tesseract ignore totally the regex... — leoden, Aug 07 '15 at 10:14
Did you try `[A-Z][A-Z][0-9][0-9][0-9]`? Did you define it in `/path/to/eng.user-patterns`? Does */path/to/configs/bazaar* contain `user_patterns_suffix user-patterns`? Just guessing... — Wiktor Stribiżew, Aug 07 '15 at 12:33
Yes and yes. The result is the same. There is no error, it just does nothing. I'm on windows 8 btw and I am editing the file with the unix line ending with notepad2 — leoden, Aug 07 '15 at 13:00
This feature most probably doesn't work anymore. https://github.com/tesseract-ocr/tesseract/issues/960 — NightFury13, Apr 19 '18 at 09:32

hashtagjet · Answer 1 · 2019-08-12T03:42:13.950

-1

Try using this pattern with quantifiers instead.

[a-zA-Z]{2}\d{3}

This should cover only 2 alphabetical characters and 3 digits.

The reason why you are matching everything before is because \w is alphanumeric.

edited Aug 12 '19 at 03:42

answered Aug 11 '19 at 10:20

hashtagjet

111
6

Tesseract OCR force pattern

1 Answers1

Linked