1

I run a simple program using Tesseract and the Java wrapper library Tess4J, on Mac OS X. Tried both JDK7 and JDK8.

The code does OCR on an image and creates a PDF out of it. The code works and does what it's supposed to do (the pdf gets created just fine). But at the end, I get a crash report on my Mac.

private static void testTesseract() throws Exception {
    File imageFile = new File("/Users/mln/Desktop/urkunde.jpg");
    ITesseract instance = new Tesseract();  // JNA Interface Mapping

    // http://tess4j.sourceforge.net/tutorial/

    instance.setDatapath("/Users/mln/Desktop/tessdata");
    instance.setLanguage("deu");

    try {
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (TesseractException e) {
        System.err.println(e.getMessage());
    }

    List<ITesseract.RenderedFormat> list = new ArrayList<ITesseract.RenderedFormat>();
    list.add(ITesseract.RenderedFormat.PDF);
    File pdfFile = new File("/Users/mln/Desktop/urkunde.jpg");
    instance.createDocuments(pdfFile.getAbsolutePath(), "/Users/mln/Desktop/urkunde", list);

}

The line causing the crash is this last one:

instance.createDocuments(pdfFile.getAbsolutePath(), "/Users/mln/Desktop/urkunde", list);

Console output:

Warning in pixReadMemJpeg: work-around: writing to a temp file
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001295c9f72, pid=6336, tid=5891
#
# JRE version: Java(TM) SE Runtime Environment (8.0_31-b13) (build 1.8.0_31-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.31-b07 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libtesseract.dylib+0xcf72]  tesseract::TessResultRenderer::~TessResultRenderer()+0x10
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/mln/Projects/jackrabbit-client/hs_err_pid6336.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

and the crash report:

Process:               java [6336]
Path:                  /Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home/bin/java
Identifier:            net.java.openjdk.cmd
Version:               1.0 (1.0)
Code Type:             X86-64 (Native)
Parent Process:        idea [81650]
Responsible:           java [6336]
User ID:               501

Date/Time:             2016-10-28 11:09:35.377 +0200
OS Version:            Mac OS X 10.11.6 (15G1004)
Report Version:        11
Anonymous UUID:        6CF2EEC0-C9B5-315F-EB2E-5AEBDF0094FD

Sleep/Wake UUID:       F9F2D823-9374-4EC4-B8FD-9342826E1A37

Time Awake Since Boot: 600000 seconds
Time Since Wake:       10000 seconds

System Integrity Protection: enabled

Crashed Thread:        4

Exception Type:        EXC_BAD_ACCESS (SIGABRT)
Exception Codes:       EXC_I386_GPFLT
Exception Note:        EXC_CORPSE_NOTIFY

Application Specific Information:
abort() called

Complete output on pastebin: http://pastebin.com/v9gPd4hk

Mathias Conradt
  • 28,420
  • 21
  • 138
  • 192
  • The error appears to have originated from Leptonica. Could be something with your `libjpeg` library. – nguyenq Oct 29 '16 at 01:20
  • @nguyenq Seem to be some more general issues with MacOS X. I try to build the project from sources, but that also already fails in the unit tests with error `[junit] 10:04:57.140 [main] ERROR net.sourceforge.tess4j.Tesseract1 - Unable to load library 'gs': Native library (darwin/libgs.dylib) not found in resource path`. Log: http://pastebin.com/Ba4wUYYu , which seems to be a known issue as well: http://stackoverflow.com/questions/21394537/tess4j-unsatisfied-link-error-on-mac-os-x – Mathias Conradt Oct 29 '16 at 08:21
  • For `gs` issue, you will need to install GhostScript; Tess4J depends on it for reading PDF files. – nguyenq Oct 29 '16 at 13:07
  • Ok. The readme.html in tess4j though reads: "Tesseract 3.04, Leptonica 1.71 (via Lept4J), and Ghostscript 9.16 32- and 64-bit DLLs, language data for English, and sample images are bundled with the library." so I thought they're already included. – Mathias Conradt Oct 29 '16 at 15:37
  • Installed GS, but the build still fails with other errors. Complete log: http://pastebin.com/ZznbkW8v - not sure if it's this line that's causing it `java(16778,0x70000021a000) malloc: *** error for object 0x7fa6cf8e3e78: pointer being freed was not allocated`. Another error I see in the log is this: `Error looking up function 'l_bootnum_gen1': dlsym(0x7fe4f0d318d0, l_bootnum_gen1): symbol not found` – Mathias Conradt Oct 29 '16 at 15:43
  • The Windows DLLs were included; for other platforms, the users would have to install them. 'gs' seems no longer an issue. Make sure you install appropriate version of Leptonica, libtif, etc. If `Tesseract1` API causes issues, stay with `Tesseract` API. – nguyenq Oct 29 '16 at 16:35
  • Thanks for the info. I moved to Ubuntu meanwhile (Docker container on Codenvy.io) instead of Mac OS X and try to get it to run there. Almost working, just an issue with "Invalid calling convention 63" (http://pastebin.com/C8c5qkCt) but that seems to have been discussed in the Tess4J forum already, due to version compatibilities. I will figure it out... – Mathias Conradt Nov 01 '16 at 11:27
  • Saw your comment here: https://sourceforge.net/p/tess4j/discussion/1202294/thread/2a25344c/#1bf8 There are my versions: tesseract 3.04.01, leptonica-1.73, GPL Ghostscript 9.18 (2015-10-05), Tess4J 3.2.1 (also tried Tess4J 3.0.0) with which I am getting the `Invalid calling convention 63` error. – Mathias Conradt Nov 01 '16 at 11:35
  • Created a separate question for this: http://stackoverflow.com/questions/40361873/tess4j-invalid-calling-convention-63-despite-correct-versions – Mathias Conradt Nov 01 '16 at 14:15

1 Answers1

1

I haven't tested it myself, but it looks like createDocuments calls init() and dispose() and so does doOCR(). You may want to try overriding these methods to only call each one time. Kind of a shot in the dark, but it seems reasonable.

@Override
public void createDocuments(String[] filenames, String[] outputbases, List<RenderedFormat> formats) throws TesseractException {
    if (filenames.length != outputbases.length) {
        throw new RuntimeException("The two arrays must match in length.");
    }

    init();
    setTessVariables();

    try {
        for (int i = 0; i < filenames.length; i++) {
            File workingTiffFile = null;
            try {
                String filename = filenames[i];

                // if PDF, convert to multi-page TIFF
                if (filename.toLowerCase().endsWith(".pdf")) {
                    workingTiffFile = PdfUtilities.convertPdf2Tiff(new File(filename));
                    filename = workingTiffFile.getPath();
                }

                TessResultRenderer renderer = createRenderers(outputbases[i], formats);
                createDocuments(filename, renderer);
                TessDeleteResultRenderer(renderer);
            } catch (Exception e) {
                // skip the problematic image file
                logger.error(e.getMessage(), e);
            } finally {
                if (workingTiffFile != null && workingTiffFile.exists()) {
                    workingTiffFile.delete();
                }
            }
        }
    } finally {
        dispose();
    }
}
Araymer
  • 1,315
  • 1
  • 10
  • 16
  • Thanks, I will give it a try tomorrow. Unfortuntately I cannot just extend the Tesseract class, cause createRenderers and createDocuments methods are private instead of protected. So I need to build the whole thing. – Mathias Conradt Oct 28 '16 at 21:40
  • Luckily, at first glance, it doesn't look too bad. A bit of copy & paste. Good luck. – Araymer Oct 28 '16 at 21:55
  • I try to build the project from sources, but that also already fails in the unit tests with error `[junit] 10:04:57.140 [main] ERROR net.sourceforge.tess4j.Tesseract1 - Unable to load library 'gs': Native library (darwin/libgs.dylib) not found in resource path`. Log: http://pastebin.com/Ba4wUYYu , which seems to be a known issue as well: http://stackoverflow.com/questions/21394537/tess4j-unsatisfied-link-error-on-mac-os-x – Mathias Conradt Oct 29 '16 at 08:20
  • I will test it on another system other than MacOS X and see if it works there. If so, I might just ignore it for the moment, cause in the end the application will run on Linux and not on Mac anyway. This way I could also stick with the original libs from the Maven repos. – Mathias Conradt Oct 29 '16 at 08:20
  • `init()` and `dispose()` methods get called at various points, not only in `doOCR()` and `createDocuments()`, but also in `getSegmentedRegions`, `getWords()`. When using the library in an application, I never know which is called first. I tried to comment out the methods in createDocuments and build the Tess4J project which runs somt unit tests. It fails with NullPointer exception. Commenting them out in `doOCR()` also seems to cause NullPointerExceptions. I think, if the problem lies in `init` or `dispose`, these methods would need to do some kind of checks whether already inited or not. – Mathias Conradt Oct 29 '16 at 15:52
  • I think, if you're using tess4j, just extending the Tesseract class and overriding the necessary methods is probably the easier way to handle this rather than altering source code for the project. – Araymer Nov 01 '16 at 19:03
  • Also, you can extend the TessBaseApi and other classes if needed. Just dig around in the source code and write your own method using their native calls, if you need to. Don't go rebuilding stuff. You'll almost certainly break things. – Araymer Nov 01 '16 at 19:15
  • Extending does not work that easily, I would need to copy over too much code. Cause methods and properties in Tesseract.java that are being called are private, i.e. https://snag.gy/uDVOg4.jpg and others ("api" and "handler" properties are private). Since I will host the app on Linux, I switched to a Docker container with Ubuntu (also for development, doing this on Codenvy.io now), where this particular issue does not exist. Thanks for your input and ideas though. – Mathias Conradt Nov 01 '16 at 19:29
  • Yeah, so, I extended the class, copied the createDocuments method and the createRenderers method and that fixed the problem after replacing the inner createDocuments call with the TessApi call that it does, anyway. Took all of about 2 minutes. I don't think there's a way to do it that wouldn't be ridiculous aside from just copying the code and omitting the init and dispose calls where appropriate. – Araymer Nov 01 '16 at 21:09
  • Ok, I see what you mean. I could also replace `api` occurrences with `getAPI()` and `handle` with `getHandle()`. So I guess I should now basically have the same what you did. I kept the `init()` method in `doOCR()` and removed the `dispose()` in there. In `createDocuments()` I disable the `init()` but kept the `dispose()` method there, since `createDocuments()` is called after `doOCR()`. But I still get the same error. This is my extended class now: http://pastebin.com/SLbSTuvA (You are also working on Mac OS X and had the same error before?) – Mathias Conradt Nov 01 '16 at 21:32
  • But anyway, I won't deploy to Mac anyway, so I would rather stick with the standard library on Linux and keep working with Linux. No need to bother with Mac, if that's not the final platform. – Mathias Conradt Nov 01 '16 at 21:40