Tess4j and Tesseract training give different results

Question

For a project I am trying to take a picture using a webcam and then perform OCR on it via the Tess4j API, which uses Tesseract dll's.

I build a Tess4j jar from source and included this in my project. I added the liblept168.dll and the libtesseract302.dll files as well as the provided tessdata configurations from Tess4j. During my first tests I found that Tess4j is not very accurate in recognizing the provided text. More specifically, for the picture on https://i.stack.imgur.com/25ACm.png it finds the text

The fqUíCk) brown ifoxš
Jumps! over the

$3,456.78 <lazy>?90 dog
& duck/goose, as 12.5%
of E-mail from aspammerß

website.com is spam? ;

Now I tried to train Tesseract to improve accuracy. For this i installed Tesseract 3.02 on my Windows 8.1 machine and via commandline I created a box file. Now opening it in CowBoxer it recognizes the text as

The (quiCk) brown {fox}
Jumps! over the
$3,456.78 <1azy>fi90 dog
& duck/goose, as 12.5%
of E-mai1 from aspammer@
website.com is spam?

This is a very significant difference. It switches 1 and l but that is to be expected as these are the same in my used font. How can there be such a big difference and more specifically, how can I make Ttess4j use the same config as my local Tesseract. For portability reasons I cannot install Tesseract on the machine the program will be running on.

What exact command did you use to make the box file? Set the same variable in Tess4J and try it again. Btw, the image resolution is too low; try to rescale it to 300 DPI for optimal results. — nguyenq, Dec 19 '13 at 01:06
just running "tesseract test.png out" outputs the text to out.txt and in Tess4j I can't for the life of me find out what command it runs exactly as it works via com.sun.jna.Library which seems to just delegate its calls to the dll file but I don't have any experience with that — Stijnvdk, Dec 19 '13 at 09:19
Your Tesseract version may be different from the one bundled with Tess4J, which invokes Tesseract API exposed by its C-API interface. — nguyenq, Dec 20 '13 at 15:07
They have different revision numbers. Also, Tess4J bundles an older, smaller-size English language data. — nguyenq, Dec 21 '13 at 16:37
I tried to copy the tessdata folder from tesseract to Tess4j but it still doesn't manage to recognize it very well. Guess I will write my own wrapper on the EXE then. Find that Tess4j is quite useless — Stijnvdk, Jan 06 '14 at 09:54
As I said, I wrote my own wrapper on the exe. See http://stackoverflow.com/questions/5604698/java-programming-call-an-exe-from-java-and-passing-parameters. Also, this question is a year old. — Stijnvdk, Dec 29 '14 at 09:40

Tess4j and Tesseract training give different results

0 Answers0