2

I have a few PDFs that I was able to parse until a few days ago using tika.

I have not changed anything from my code, but am no longer able to view the content in those same PDFs by running the below code:

from tika import parser

raw = parser.from_file('reits.pdf', 'http://localhost:9998/tika')
print(raw['content'])

This was working fine with the latest installation of tika until recently conda install -c conda-forge tika

It seems like the problem is that Java is not activating. I am getting the below error when I look at the metadata

'X-TIKA:EXCEPTION:runtime': 'java.lang.NullPointerException\n\tat

I am not sure if the below is helpful, but the metadata also returned:

X-Parsed-By': ['org.apache.tika.parser.DefaultParser','org.apache.tika.parser.pdf.PDFParser']

What can I do to get tika to start working again?

In case this helps:

The full exception stack trace is included below:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@3b75f5cb
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
    at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
    at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358)
    at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309)
    at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
    at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
    at javax.swing.TransferHandler.importData(TransferHandler.java:827)
    at javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1544)
    at java.awt.dnd.DropTarget.drop(DropTarget.java:455)
    at javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1282)
    at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:538)
    at sun.lwawt.macosx.CDropTargetContextPeer.processDropMessage(CDropTargetContextPeer.java:143)
    at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:852)
    at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:776)
    at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:48)
    at java.awt.Component.dispatchEventImpl(Component.java:4744)
    at java.awt.Container.dispatchEventImpl(Container.java:2297)
    at java.awt.Component.dispatchEvent(Component.java:4711)
    at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904)
    at java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4609)
    at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4471)
    at java.awt.Container.dispatchEventImpl(Container.java:2283)
    at java.awt.Window.dispatchEventImpl(Window.java:2746)
    at java.awt.Component.dispatchEvent(Component.java:4711)
    at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760)
    at java.awt.EventQueue.access$500(EventQueue.java:97)
    at java.awt.EventQueue$3.run(EventQueue.java:709)
    at java.awt.EventQueue$3.run(EventQueue.java:703)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
    at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84)
    at java.awt.EventQueue$4.run(EventQueue.java:733)
    at java.awt.EventQueue$4.run(EventQueue.java:731)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
    at java.awt.EventQueue.dispatchEvent(EventQueue.java:730)
    at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
    at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
    at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
    at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
Caused by: java.lang.NullPointerException
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 44 more

EDIT I was able to get Tika to work, by following this answer

Specifically, I changed my directory to be where I downloaded the Tika server file, and then ran: java -jar tika-server-x.x.jar -h 0.0.0.0

Once I ran the above in my cmd line, the server had started, my code worked and I could view the content.

How can I make sure that Tika in python automatically opens the server to avoid this manual workaround? Is there an environment variable that I need to set?

santorch
  • 151
  • 1
  • 14
  • What version of Tika are you running? If you try the standalone Tika App runnable jar, does that give you the content on those files, or an error? – Gagravarr May 17 '20 at 03:10
  • With the runnable Jar I got the same Java nullpoint error. Tika version 1.24 – santorch May 17 '20 at 03:15
  • Can you post the full error stacktrace? (The python wrapper on the server might hide it, but the Tika App jar ought to spit the whole thing out!). Just edit your question and put the stacktrace in there – Gagravarr May 17 '20 at 03:46
  • Sure thing! Just edited the question to include it – santorch May 17 '20 at 12:41
  • @Gagravarr I was able to get the Python code to work, after manually starting the tika server, by using the steps I just posted to the edits. Do you know how I can have this working as it should in Python? I.e. automatically – santorch May 17 '20 at 17:37
  • That stacktrace looks like you've found an Apache Tika / Apache PDFBox bug. I'd suggest you raise [a bug with Apache Tika](https://issues.apache.org/jira/browse/TIKA) and attach a problem file – Gagravarr May 18 '20 at 00:48
  • @Gagravarr For my own knowledge - what from the stacktrace indicates the bug? – santorch May 18 '20 at 01:06
  • Normally working happy code doesn't throw uncaught Null Pointer Exceptions! – Gagravarr May 18 '20 at 05:11

1 Answers1

0

I was getting the same error. The fix in my case was that to kill the process running on port 9998 and then to re-execute it.

I was running python3 on my Google Compute Engine instance and had to restart it, post that, somehow the tika server became stable.

Below is the fix:

fuser 9998/tcp
fuser -k 9998/tcp
Leviathan
  • 2,468
  • 1
  • 18
  • 24
Imran Khan
  • 61
  • 2