3

I'm trying to get Tesseract (using the Tess4J wrapper) to match only a specific pattern. The pattern is four digits in a row, which I think would be \d\d\d\d. Here is a VERY small subset of the image I'm feeding tesseract (the floorplans are restricted, so I'm cautious to post much more of it): http://mike724.com/view/a06771

I'm using the following java code:

    File imageFile = new File("/<redacted>/file.pdf");

    Tesseract instance = Tesseract.getInstance();
    instance.setTessVariable("load_system_dawg", "F");
    instance.setTessVariable("load_freq_dawg", "F");
    instance.setTessVariable("user_words_suffix", "");
    instance.setTessVariable("user_patterns_suffix", "\\d\\d\\d\\d");

    try {
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (TesseractException e) {
        System.err.println(e.getMessage());
    }

The problem I'm running into is that tesseract seems to not be honoring these configuration options, I still get text/words in the results. I expect to get only the room numbers (ex. 2950).

user3426373
  • 31
  • 1
  • 3
  • Tesseract is not a parser. It just gives you what it reads. You have to choose what you need after ! – Alto Jan 14 '15 at 10:02
  • Well, yeah, but I figure if I "train" tesseract that I only want numbers and only want numbers in groups of four, it would increase accuracy. Right now the accuracy is terrible, completely unusable. – user3426373 Jan 15 '15 at 14:14
  • Add a WhiteList of char (0123456789) will help you too ! – Alto Jan 15 '15 at 14:23

2 Answers2

2

You have not configured this correctly.

user_patterns_suffix is meant to indicate the file extension of a text file that contains your patterns, e.g.

user_patterns_suffix pats

would mean you need to put a file in the tesseract tessdata folder

tessdata/eng.pats

... assuming eng was the language you were using.

See more here:

http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data

I do recall that user patterns may not be any shorter than 6 fixed chars before a pattern so you may not be able to accomplish this in any case - but try the correct config first.

PorridgeBear
  • 1,183
  • 1
  • 12
  • 19
  • Also you need at least kSaneNumConcreteChars characters at the start of the pattern, but, from what I can tell looking at the code, it is set to 0 (on the master branch). – user3426373 Jan 15 '15 at 14:20
0

They look like init-only parameters; as such, they need to be in a configs file, for instance, named bazaar placed under configs folder, to be be passed into setConfigs method.

instance.setConfigs(Arrays.asList("bazaar");

References:
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc
https://github.com/tesseract-ocr/tesseract/wiki/ControlParams
http://tess4j.sourceforge.net/docs/docs-1.4/

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • Thank you, I didn't know (and couldn't find anything) about the setConfigs method in Tess4J. My only other problem at the moment is the kSaneNumConcreteChars limit, but for that I guess I just have to use a custom build of tesseract. – user3426373 Jan 20 '15 at 04:25