2

For a project I am trying to take a picture using a webcam and then perform OCR on it via the Tess4j API, which uses Tesseract dll's.

I build a Tess4j jar from source and included this in my project. I added the liblept168.dll and the libtesseract302.dll files as well as the provided tessdata configurations from Tess4j. During my first tests I found that Tess4j is not very accurate in recognizing the provided text. More specifically, for the picture on https://i.stack.imgur.com/25ACm.png it finds the text

The fqUíCk) brown ifoxš
Jumps! over the

$3,456.78 <lazy>?90 dog
& duck/goose, as 12.5%
of E-mail from aspammerß

website.com is spam? ;

Now I tried to train Tesseract to improve accuracy. For this i installed Tesseract 3.02 on my Windows 8.1 machine and via commandline I created a box file. Now opening it in CowBoxer it recognizes the text as

The (quiCk) brown {fox}
Jumps! over the
$3,456.78 <1azy>fi90 dog
& duck/goose, as 12.5%
of E-mai1 from aspammer@
website.com is spam?

This is a very significant difference. It switches 1 and l but that is to be expected as these are the same in my used font. How can there be such a big difference and more specifically, how can I make Ttess4j use the same config as my local Tesseract. For portability reasons I cannot install Tesseract on the machine the program will be running on.

Stijnvdk
  • 556
  • 4
  • 7
  • 20
  • What exact command did you use to make the box file? Set the same variable in Tess4J and try it again. Btw, the image resolution is too low; try to rescale it to 300 DPI for optimal results. – nguyenq Dec 19 '13 at 01:06
  • just running "tesseract test.png out" outputs the text to out.txt and in Tess4j I can't for the life of me find out what command it runs exactly as it works via com.sun.jna.Library which seems to just delegate its calls to the dll file but I don't have any experience with that – Stijnvdk Dec 19 '13 at 09:19
  • Your Tesseract version may be different from the one bundled with Tess4J, which invokes Tesseract API exposed by its C-API interface. – nguyenq Dec 20 '13 at 15:07
  • @nguyenq It is both version 3.02 – Stijnvdk Dec 20 '13 at 15:23
  • They have different revision numbers. Also, Tess4J bundles an older, smaller-size English language data. – nguyenq Dec 21 '13 at 16:37
  • I tried to copy the tessdata folder from tesseract to Tess4j but it still doesn't manage to recognize it very well. Guess I will write my own wrapper on the EXE then. Find that Tess4j is quite useless – Stijnvdk Jan 06 '14 at 09:54
  • did u manage to get it right ?? – John x Dec 28 '14 at 21:54
  • As I said, I wrote my own wrapper on the exe. See http://stackoverflow.com/questions/5604698/java-programming-call-an-exe-from-java-and-passing-parameters. Also, this question is a year old. – Stijnvdk Dec 29 '14 at 09:40

0 Answers0