I have a few PDFs that I was able to parse until a few days ago using tika
.
I have not changed anything from my code, but am no longer able to view the content in those same PDFs by running the below code:
from tika import parser
raw = parser.from_file('reits.pdf', 'http://localhost:9998/tika')
print(raw['content'])
This was working fine with the latest installation of tika until recently conda install -c conda-forge tika
It seems like the problem is that Java is not activating. I am getting the below error when I look at the metadata
'X-TIKA:EXCEPTION:runtime': 'java.lang.NullPointerException\n\tat
I am not sure if the below is helpful, but the metadata also returned:
X-Parsed-By': ['org.apache.tika.parser.DefaultParser','org.apache.tika.parser.pdf.PDFParser']
What can I do to get tika to start working again?
In case this helps:
The full exception stack trace is included below:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@3b75f5cb
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358)
at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309)
at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
at javax.swing.TransferHandler.importData(TransferHandler.java:827)
at javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1544)
at java.awt.dnd.DropTarget.drop(DropTarget.java:455)
at javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1282)
at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:538)
at sun.lwawt.macosx.CDropTargetContextPeer.processDropMessage(CDropTargetContextPeer.java:143)
at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:852)
at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:776)
at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:48)
at java.awt.Component.dispatchEventImpl(Component.java:4744)
at java.awt.Container.dispatchEventImpl(Container.java:2297)
at java.awt.Component.dispatchEvent(Component.java:4711)
at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904)
at java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4609)
at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4471)
at java.awt.Container.dispatchEventImpl(Container.java:2283)
at java.awt.Window.dispatchEventImpl(Window.java:2746)
at java.awt.Component.dispatchEvent(Component.java:4711)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760)
at java.awt.EventQueue.access$500(EventQueue.java:97)
at java.awt.EventQueue$3.run(EventQueue.java:709)
at java.awt.EventQueue$3.run(EventQueue.java:703)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84)
at java.awt.EventQueue$4.run(EventQueue.java:733)
at java.awt.EventQueue$4.run(EventQueue.java:731)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:730)
at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
Caused by: java.lang.NullPointerException
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 44 more
EDIT I was able to get Tika to work, by following this answer
Specifically, I changed my directory to be where I downloaded the Tika server file, and then ran:
java -jar tika-server-x.x.jar -h 0.0.0.0
Once I ran the above in my cmd line, the server had started, my code worked and I could view the content
.
How can I make sure that Tika in python automatically opens the server to avoid this manual workaround? Is there an environment variable that I need to set?