1

I am using these two dependencies:- tika core 2.6.0 tika parser standard package 2.6.0 .Parsing is working fine for these cases:- pdf file with text. pdf file with images. text files and other extensions.

Parsing is failing with pdfparser runtime exception for the use case below:- pdf file with text inside images.

Can someone pls suggest how to resolve failed case here. Thanks

Full error Stack trace:-

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2d539b25 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] Caused by: java.lang.NullPointerException at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:520) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:786) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:154) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1277) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:198) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] ... 37 more

DeadPool
  • 40
  • 8
  • 1
    What's the exception you get? – Gagravarr Nov 14 '22 at 03:50
  • 500 internal server error:- org.apache.tika.exception.TikaException: Unexpected Runtime exception from org.apache.parser.pdf.PDFParser@12345 – DeadPool Nov 14 '22 at 05:00
  • This issue is not there if we use tika-parsers 1.28.5 and tika-core 1.28.5. As part of moving to the upgraded version. I need to update these versions to 1.2.60 and follow the modular approach in Tika we have currently. – DeadPool Nov 14 '22 at 13:08
  • Not able to update the above comment. this is the exception @Gagravarr . org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@350ab3f1 – DeadPool Nov 14 '22 at 14:02
  • 1
    Can you provide more of the stacktrace? Is that your server error or ours in tika-server? – Tim Allison Nov 15 '22 at 22:20
  • Hi @TimAllison, this is tika error i am getting from PDF Parser. i will see if i can get the whole stack trace. – DeadPool Nov 16 '22 at 14:58
  • Hi Tim, this is the whole stack trace :- org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2d539b25 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:175) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] – DeadPool Nov 22 '22 at 15:47
  • Caused by: java.lang.NullPointerException at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:520) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:786) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:154) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] – DeadPool Nov 22 '22 at 15:49
  • at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:365) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:137) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1277) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[org.apache.pdfbox.pdfbox-2.0.27.jar:2.0.27] – DeadPool Nov 22 '22 at 15:49
  • at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:198) ~[org.apache.tika.tika-parsers-standard-package-2.6.0.jar:2.6.0] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[org.apache.tika.tika-core-2.6.0.jar:2.6.0] ... 37 more – DeadPool Nov 22 '22 at 15:49
  • please note that same code will work fine if i change the dependenices from tika-parser-standard-package 2.6.0 to tika-parsers-1.28.5 and tika-core-2.6.0 to tika-core-1.28.5 – DeadPool Nov 22 '22 at 15:51
  • 2
    Issue is fixed now. Its expecting Parser.class instance for scanned images.I have passed that in context. Thanks – DeadPool Nov 23 '22 at 11:55
  • @DeadPool do we need to install tesseract ocr engine separately or adding tika core and tika parser is enough?? – Seriously Apr 26 '23 at 02:58
  • 1
    @Seriously sorry for the delay reply. Yes we need to install tesseract on the top of tika parser and core and also to set path configurations if required. – DeadPool May 09 '23 at 10:14

1 Answers1

0

You should use different PDFParserConfig There are 2 types of pdfs files

  1. native files (also called searchble) - Tika is able to extract text from native without ocr

    PDFParserConfig pdfParserConfig = new PDFParserConfig();

    pdfParserConfig.setExtractInlineImages(false);

    pdfParserConfig.setOcrStrategy(NO_OCR);

  2. scanned files (or images converted to pdf) - Tika has to do the OCR (using the tesseract under the hood)

    PDFParserConfig pdfParserConfig = new PDFParserConfig();

    pdfParserConfig.setExtractInlineImages(true);

    pdfParserConfig.setOcrStrategy(OCR_ONLY);

marek.kapowicki
  • 674
  • 2
  • 5
  • 17
  • 1
    Currently i am using this configuration:- pdfParserConfig.setExtractInlineImages(true) pdfParserConfig.setExtractUniqueInlineImagesOnly(true) It is working fine for text inside pdf, images inside pdf but failing for embedded text inside image in pdf. if i update the conf to:- pdfParserConfig.setExtractInlineImages(true) pdfParserConfig.setExtractUniqueInlineImagesOnly(true) pdfParserConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY) Its failing for all types of pdf files. – DeadPool Nov 15 '22 at 18:33
  • I m using both. the first attempt is to extract text from pdf without OCR. It works for native/searchable pdfs. If it works I return extracted text. In not I know that the file is a scan and OCR is required. This is the best solution that I ve found so far. – marek.kapowicki Nov 17 '22 at 09:51
  • Hi Merek, May i know which dependencies and their versions you are using. I have done the second point implementation and trying with scanned files only. Its giving the same error. – DeadPool Nov 22 '22 at 12:47
  • Issue is fixed now. Its expecting Parser.class instance for scanned images.I have passed that in context. Thanks – DeadPool Nov 23 '22 at 11:55