I am extracting text from image pdf using Tes4j. There are two steps involved here: 1)convert pdf to image:
PdfUtilities.convertPdf2Png(inputfilepath);
This works without any issues. 2)extract text from image:
try {
if(imgName.endsWith(".png")){
ITesseract instance = new Tesseract();
instance.setDatapath("tessdataPath");
extractedData= instance.doOCR(Image);
}
catch(Exception e2){
System.out.println("exception:"+e2.getMessage());
}
}
}
While doing this I get below exception for specific image file.
Exception in thread "main" java.lang.Error: Invalid memory access
at com.sun.jna.Native.invokePointer(Native Method)
at com.sun.jna.Function.invokePointer(Function.java:470)
at com.sun.jna.Function.invoke(Function.java:404)
at com.sun.jna.Function.invoke(Function.java:315)
at com.sun.jna.Library$Handler.invoke(Library.java:212)
at com.sun.proxy.$Proxy1.TessBaseAPIGetUTF8Text(Unknown Source)
at net.sourceforge.tess4j.Tesseract.getOCRText(Unknown Source)
at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
at com.tcs.textExtraction.ImgToText.imagetoText(ImgToText.java:109)
at com.tcs.textExtraction.ImgToText.main(ImgToText.java:31)
split_pt >0 && split_pt < word->chopped_word->NumBlobs():Error:Assert failed:in file ..\..\ccmain\tfacepp.cpp, line 186
I have included following jars: jna.jar,log4j-1.2.17.jar,pdfbox-1.8.13.jar,tess4j.jar,commons-logging-1.1.3.jar,fontbox-1.8.13.jar,ghost4j-0.5.1.jar,itext-2.1.7.jar,jai_imageio.jar my tessdata has following files: pdf.ttf,pdf.ttx,eng.traineddata,osd.traineddata