3

I am extracting text from image pdf using Tes4j. There are two steps involved here: 1)convert pdf to image:

PdfUtilities.convertPdf2Png(inputfilepath);

This works without any issues. 2)extract text from image:

try {
if(imgName.endsWith(".png")){
            ITesseract instance = new Tesseract(); 
            instance.setDatapath("tessdataPath");
            extractedData= instance.doOCR(Image);
            }
            catch(Exception e2){
                System.out.println("exception:"+e2.getMessage());
            }
        }
    }

While doing this I get below exception for specific image file.

Exception in thread "main" java.lang.Error: Invalid memory access
    at com.sun.jna.Native.invokePointer(Native Method)
    at com.sun.jna.Function.invokePointer(Function.java:470)
    at com.sun.jna.Function.invoke(Function.java:404)
    at com.sun.jna.Function.invoke(Function.java:315)
    at com.sun.jna.Library$Handler.invoke(Library.java:212)
    at com.sun.proxy.$Proxy1.TessBaseAPIGetUTF8Text(Unknown Source)
    at net.sourceforge.tess4j.Tesseract.getOCRText(Unknown Source)
    at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
    at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
    at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
    at com.tcs.textExtraction.ImgToText.imagetoText(ImgToText.java:109)
    at com.tcs.textExtraction.ImgToText.main(ImgToText.java:31)
split_pt >0 && split_pt < word->chopped_word->NumBlobs():Error:Assert failed:in file ..\..\ccmain\tfacepp.cpp, line 186

I have included following jars: jna.jar,log4j-1.2.17.jar,pdfbox-1.8.13.jar,tess4j.jar,commons-logging-1.1.3.jar,fontbox-1.8.13.jar,ghost4j-0.5.1.jar,itext-2.1.7.jar,jai_imageio.jar my tessdata has following files: pdf.ttf,pdf.ttx,eng.traineddata,osd.traineddata

  • 1
    Not related to your problem but important anyway: your PDFBox version is outdated. 1.8.13 is current in the 1.8 branch. Not 1.8.1 and not 1.8.4. And using two different commons-logging versions is also weird. – Tilman Hausherr Mar 03 '17 at 09:10
  • 1
    I get 4 stackoverflow hits by entering in google: tess4j Invalid memory access. Did none help you? – Tilman Hausherr Mar 05 '17 at 20:20
  • Thanks for your attention. I now have PdfBox 1.8.13 and removed one version of commons-logging(1.1.2). Also I went through other links for this question:1)http://stackoverflow.com/questions/19894890/tess4j-invalid-memory-access 2)http://stackoverflow.com/questions/35295582/tess4j-memory-access-error-in-tess4j-java 3)http://stackoverflow.com/questions/32421492/java-tess4j-doocr-not-workin-exception-invalid-memory-access But none resolved the issue. – Shankramma Patil Mar 06 '17 at 06:18
  • I have modified the question to reflect the latest code and error and jars used. – Shankramma Patil Mar 06 '17 at 07:02
  • Possible duplicate of [Tess4J: Invalid memory access](http://stackoverflow.com/questions/19894890/tess4j-invalid-memory-access) – Raedwald Mar 06 '17 at 07:59
  • I checked that solution, however it did not resolve my problem. – Shankramma Patil Mar 06 '17 at 13:54
  • Make sure you `setDatapath` to the parent directory of `tessdata` directory. – nguyenq Mar 07 '17 at 23:34
  • Have tried this, but no luck.. – Shankramma Patil Mar 08 '17 at 12:20
  • I had the same issue, I solved it by converting input image in different RGB format (dont remember which one). – Radim Burget Jun 08 '17 at 13:02

0 Answers0