11

I want to read a specific character sequence with Tesseract like this post : Tesseract OCR: is it possible to force a specific pattern?

I have tried bazaar matching pattern in Tesseract with the pattern \d\d\d\A\A and OCR still recognize other words which doesn't match.

I have tried to use the "tessedit_char_whitelist" parameter but I can't choose the position of the characters with that.

  • I launch the command : tesseract image.jpg result -l eng bazaar And I have this message :

Please provide at least 4 concrete characters at the beginning of the pattern

Invalid user pattern \A\A\d\d\d

Tesseract Open Source OCR Engine v3.01 with Leptonica

  • image.jpg :

enter image description here

  • The result :

      AB123
      ABC12
      A1234
      12345
      ABCD1
    

So it is wrong, I just wanted to catch the sequence "AB123".

Can somebody tell me why the regular expression in my user-patterns file as no effect ? For the configuration, I have strictly followed the bazaar tutorial.

CinCout
  • 9,486
  • 12
  • 49
  • 67
leoden
  • 301
  • 3
  • 10
  • I believe this error: _Please provide at least 4 concrete characters at the beginning of the pattern_ pretty much explains itself. This is probably a limitation from whatever you are using is. Also try `\w\w\d\d\d`, `\A` is not what you want for all "characters". Try it [here](https://regex101.com/r/uQ3oQ9/1). – Asunez Aug 07 '15 at 09:42
  • I tried `\w\w\d\d\d` and I have the same error : Please provide at least 4 concrete characters at the beginning of the pattern Invalid user pattern \w\w\d\d\d. – leoden Aug 07 '15 at 09:52
  • I have added 4 concrete characters to my pattern : `TEST\w\w\d\d\d` and tested with the words `TESTAB123 TESTABC12` etc ... I have no more the error _Please provide at least 4 concrete characters at the beginning of the pattern_ but I still have _Invalid user pattern TEST\w\w\d\d\d_. I don't understand why it is invalid – leoden Aug 07 '15 at 10:11
  • Because \w\w are not recognize by tesseract. I tried to use \c\c and I have no more error message. But the result is stil wrong, is like tesseract ignore totally the regex... – leoden Aug 07 '15 at 10:14
  • 1
    Did you try `[A-Z][A-Z][0-9][0-9][0-9]`? Did you define it in `/path/to/eng.user-patterns`? Does */path/to/configs/bazaar* contain `user_patterns_suffix user-patterns`? Just guessing... – Wiktor Stribiżew Aug 07 '15 at 12:33
  • Yes and yes. The result is the same. There is no error, it just does nothing. I'm on windows 8 btw and I am editing the file with the unix line ending with notepad2 – leoden Aug 07 '15 at 13:00
  • 2
    This feature most probably doesn't work anymore. https://github.com/tesseract-ocr/tesseract/issues/960 – NightFury13 Apr 19 '18 at 09:32

1 Answers1

-1

Try using this pattern with quantifiers instead.

[a-zA-Z]{2}\d{3}

This should cover only 2 alphabetical characters and 3 digits.

The reason why you are matching everything before is because \w is alphanumeric.

hashtagjet
  • 111
  • 6