5

I'm using Tess4J (JNA wrapper around tesseract), and trying to call tess.doOCR(myFile) to OCR text from a single-page PDF.

I have GhostScript installed (by using yum install ghostscript), gs -h works correctly.

My app server is using 64-bit JVM, and I have gsdll64.dll, and the 64-bit tesseract dll's liblept168.dll and libtesseract302.dll in the class path.

When tess.doOCR(myFile) is called, this is logged:

GPL Ghostscript 8.70 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1

But then it just stops there. The program doesn't go any further.

UPDATE --

It looks like the real issue is from this error:

java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract.so) not found in resource path

After looking around a lot, I don't see a convenient place to find this libtesseract.so file, and I'm not sure what it takes to get this onto my Linux app server. I read that maybe I need to download some C++ runtime, but I don't see a Linux download for that. Any advice would be much appreciated.

Or is this something to do with a symbolic link?

Don Cheadle
  • 5,224
  • 5
  • 39
  • 54

4 Answers4

5

The Fix was simple for me,just do sudo apt-get install tesseract-ocr from the command line. For linux you dont need to worry about the DDL librarires or the jvm version. Installing tessearct from apt-get will do the trick.

Jimmy
  • 2,165
  • 1
  • 17
  • 13
  • Yeah looking back, the issue (I think) was I was using `yum` package manager (on some kind of RedHat or something), and tesseract-ocr was not a convenient download. Recalling, it was a nightmare to get it to work without having it available through package management. I definitely think switching to Ubuntu or something debian (with `apt-get`) makes life a lot easier to get tesseract working... – Don Cheadle May 14 '15 at 20:55
2

Tess4J should include required libraries. However, you need to extract them first.

This should do the trick:

File tmpFolder = LoadLibs.extractTessResources("win32-x86-64"); // replace platform
System.setProperty("java.library.path", tmpFolder.getPath());

You should replace the argument of extractTessResources(..) with your platform. You can find possible options by looking into the Tess4J jar file.

This way you need not to install Tesseract on your system.

Recently I wrote a blog post about Tess4J in which I used this technique. Maybe it can help if you need further information or a running example project.

micha
  • 47,774
  • 16
  • 73
  • 80
1

Those DLLs are for Windows. For Linux, you'll need to install or build from Tesseract source.

That GS version, 8.70, is quite old. The latest Ghost4J library that Tess4J uses is not compatible with that.

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • is it possible to specify a different version when executing `yum install ghostscript`? otherwise, what is the simplest way to install GhostScript on Linux without `yum install`? p.s. thank you for so actively helping those trying to work with Tess4J here on SO and other places – Don Cheadle Oct 27 '14 at 00:16
  • Looks like you have to [build](http://ghostscript.com/doc/current/Make.htm#Unix_build) it from the [source](http://downloads.ghostscript.com/public/), if the latest is not available from the repository. – nguyenq Oct 28 '14 at 03:36
  • I switched from a Red Hat distro to Ubuntu and it made the process **so** much easier to install tesseract and ghostscript. `apt-get install tesseract` got tesseract 3.03 setup and working, and `apt-get install ghostscript` got ghostscript 9.10 working fine. Dumb question: if tesseract is installed and working on its own, and ghostscript, do I only need the JAR's from Tess4J? (and not the traineddata, tessdata folder, DLL's, other stuff) – Don Cheadle Oct 28 '14 at 15:18
  • Yes, you do. Make sure to use a compatible version with your Tesseract version. – nguyenq Oct 28 '14 at 22:56
  • 1
    from my experience on Ubuntu 14.04 LTS, all I needed to do was `apt-get install tesseract-ocr` and `ghostscript`. Then, I pointed TESSDATA_PREFIX env variable to the proper directory `apt-get` installed tesseract to (but I still needed to setDataPath on my Tess4J instance, even though the env var existed...). Then I included the JAR's that came with Tess4J's download (tess4j, ghostscript, log4j, imageio) on the class path... and that's all it took to get working. So it seems `apt-get install tesseract-ocr` got me the proper DLL's, and eng.traineddata... – Don Cheadle Oct 29 '14 at 14:33
  • Proper `.so`, not `.dll`. – nguyenq Oct 29 '14 at 23:36
0
sudo apt-get update
sudo apt-get install tesseract-ocr 

download test data by git

https://github.com/tesseract-ocr/tessdata
benderalex5
  • 129
  • 1
  • 3
  • It is not clear how your answer addresses the question. Why will downloading test data correct resource not found in path? – Simon.S.A. Apr 12 '20 at 22:15
  • It installs tesseract that contains the library in question and adds it to the library path afaik. – keiki Mar 02 '21 at 18:58