For a project I am trying to take a picture using a webcam and then perform OCR on it via the Tess4j API, which uses Tesseract dll's.
I build a Tess4j jar from source and included this in my project. I added the liblept168.dll and the libtesseract302.dll files as well as the provided tessdata configurations from Tess4j. During my first tests I found that Tess4j is not very accurate in recognizing the provided text. More specifically, for the picture on https://i.stack.imgur.com/25ACm.png it finds the text
The fqUíCk) brown ifoxš
Jumps! over the
$3,456.78 <lazy>?90 dog
& duck/goose, as 12.5%
of E-mail from aspammerß
website.com is spam? ;
Now I tried to train Tesseract to improve accuracy. For this i installed Tesseract 3.02 on my Windows 8.1 machine and via commandline I created a box file. Now opening it in CowBoxer it recognizes the text as
The (quiCk) brown {fox}
Jumps! over the
$3,456.78 <1azy>fi90 dog
& duck/goose, as 12.5%
of E-mai1 from aspammer@
website.com is spam?
This is a very significant difference. It switches 1 and l but that is to be expected as these are the same in my used font. How can there be such a big difference and more specifically, how can I make Ttess4j use the same config as my local Tesseract. For portability reasons I cannot install Tesseract on the machine the program will be running on.