Issue using Apache tika parser when trying to parse pdf having text contains image

Question

I am using these two dependencies:- tika core 2.6.0 tika parser standard package 2.6.0 .Parsing is working fine for these cases:- pdf file with text. pdf file with images. text files and other extensions.

Parsing is failing with pdfparser runtime exception for the use case below:- pdf file with text inside images.

Can someone pls suggest how to resolve failed case here. Thanks

Full error Stack trace:-

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2d539b25 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] Caused by: java.lang.NullPointerException at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:520) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:786) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:154) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1277) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:198) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] ... 37 more

500 internal server error:- org.apache.tika.exception.TikaException: Unexpected Runtime exception from org.apache.parser.pdf.PDFParser@12345 — DeadPool, Nov 14 '22 at 05:00
This issue is not there if we use tika-parsers 1.28.5 and tika-core 1.28.5. As part of moving to the upgraded version. I need to update these versions to 1.2.60 and follow the modular approach in Tika we have currently. — DeadPool, Nov 14 '22 at 13:08
Not able to update the above comment. this is the exception @Gagravarr . org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@350ab3f1 — DeadPool, Nov 14 '22 at 14:02
Can you provide more of the stacktrace? Is that your server error or ours in tika-server? — Tim Allison, Nov 15 '22 at 22:20
Hi @TimAllison, this is tika error i am getting from PDF Parser. i will see if i can get the whole stack trace. — DeadPool, Nov 16 '22 at 14:58
Hi Tim, this is the whole stack trace :- org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2d539b25 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] — DeadPool, Nov 22 '22 at 15:47
Caused by: java.lang.NullPointerException at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:520) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:786) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:154) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] — DeadPool, Nov 22 '22 at 15:49
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1277) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] — DeadPool, Nov 22 '22 at 15:49
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:198) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] ... 37 more — DeadPool, Nov 22 '22 at 15:49
please note that same code will work fine if i change the dependenices from tika-parser-standard-package 2.6.0 to tika-parsers-1.28.5 and tika-core-2.6.0 to tika-core-1.28.5 — DeadPool, Nov 22 '22 at 15:51
Issue is fixed now. Its expecting Parser.class instance for scanned images.I have passed that in context. Thanks — DeadPool, Nov 23 '22 at 11:55
@DeadPool do we need to install tesseract ocr engine separately or adding tika core and tika parser is enough?? — Seriously, Apr 26 '23 at 02:58
@Seriously sorry for the delay reply. Yes we need to install tesseract on the top of tika parser and core and also to set path configurations if required. — DeadPool, May 09 '23 at 10:14

score 0 · Answer 1 · answered Nov 15 '22 at 14:19

0

You should use different PDFParserConfig There are 2 types of pdfs files

native files (also called searchble) - Tika is able to extract text from native without ocr

PDFParserConfig pdfParserConfig = new PDFParserConfig();

pdfParserConfig.setExtractInlineImages(false);

pdfParserConfig.setOcrStrategy(NO_OCR);
scanned files (or images converted to pdf) - Tika has to do the OCR (using the tesseract under the hood)

PDFParserConfig pdfParserConfig = new PDFParserConfig();

pdfParserConfig.setExtractInlineImages(true);

pdfParserConfig.setOcrStrategy(OCR_ONLY);

answered Nov 15 '22 at 14:19

marek.kapowicki

674
2
5
17

1

Currently i am using this configuration:- pdfParserConfig.setExtractInlineImages(true) pdfParserConfig.setExtractUniqueInlineImagesOnly(true) It is working fine for text inside pdf, images inside pdf but failing for embedded text inside image in pdf. if i update the conf to:- pdfParserConfig.setExtractInlineImages(true) pdfParserConfig.setExtractUniqueInlineImagesOnly(true) pdfParserConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY) Its failing for all types of pdf files. – DeadPool Nov 15 '22 at 18:33
I m using both. the first attempt is to extract text from pdf without OCR. It works for native/searchable pdfs. If it works I return extracted text. In not I know that the file is a scan and OCR is required. This is the best solution that I ve found so far. – marek.kapowicki Nov 17 '22 at 09:51
Hi Merek, May i know which dependencies and their versions you are using. I have done the second point implementation and trying with scanned files only. Its giving the same error. – DeadPool Nov 22 '22 at 12:47
Issue is fixed now. Its expecting Parser.class instance for scanned images.I have passed that in context. Thanks – DeadPool Nov 23 '22 at 11:55

Issue using Apache tika parser when trying to parse pdf having text contains image

1 Answers1