0

I am using PDFBox and tika for content indexing of pdf file. Every thing is working fine with PDFFBox 1.8,But when is updated PDFBox to 2.0.2 then it is giving me below error:

(Thread-62 (HornetQ-client-global-threads-2071379348)) Exception while creating solr doucment for content::Failed to close temporary resources: org.apache.tika.exception.TikaException: Failed to close temporary resources
at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:149)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

at org.hornetq.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:91)
at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:983)
at org.hornetq.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:48)
at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1113)
at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not delete temporary file C:\Users\FILESE~1\AppData\Local\Temp\apache-tika-7918716906396425097.tmp
at org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70)
at org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121)
at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150)
... 18 more

Can you please help me to resolve this issue?

I updated PDFBox to 2.0.2 because of this.

My gradle dependency is :

compile "org.apache.poi:poi:3.8"
compile "org.apache.poi:poi-ooxml:3.8"
compile "org.apache.poi:poi-scratchpad:3.8"
compile "org.apache.pdfbox:pdfbox:2.0.2"

compile 'org.apache.tika:tika-parsers:1.5'
compile 'org.apache.tika:tika-core:1.5'

Here I am using tika 1.5 and this version suports pdfbox 2.0.3. you can see here

Community
  • 1
  • 1
Nitin
  • 2,701
  • 2
  • 30
  • 60
  • How did you do the upgrade? My best guess is you've not done it right – Gagravarr Sep 21 '16 at 08:05
  • I am using gradle , so i change version like : compile "org.apache.pdfbox:pdfbox:2.0.2" . and run command gradle clean build. and my POI look like this : compile "org.apache.poi:poi:3.8" compile "org.apache.poi:poi-ooxml:3.8" compile "org.apache.poi:poi-scratchpad:3.8"Note : builds are generated successfully. – Nitin Sep 21 '16 at 08:30
  • As far as I can see there is no mentioning of PDFBox in the stacktrace, the exception seems to involve only Tika classes. Thus, have you made sure your Tika version gets along with PDFBox 2.0.2? If I correctly read the Tika documentation, support for PDFBox 2.x did not exist in versions before 1.13. – mkl Sep 21 '16 at 08:58
  • I think not, because i generated build with clean option. – Nitin Sep 21 '16 at 09:01
  • Yes, Tika 1.5 supports pdfbox 2.0.3. I used pdf 2.0.2 and should I use 2.0.3? – Nitin Sep 21 '16 at 09:08
  • Really I think you need to upgrade to a newer Tika version that pulls in the new dependencies. Just grabbing a new jar won't work if the missing Tika-side changes aren't there – Gagravarr Sep 21 '16 at 09:44

1 Answers1

3

You use Tika version 1.5 and claim

Tika 1.5 supports pdfbox 2.0.3

This is extremely implausible considering that Tika 1.5 has been released in February 2014 long before there was a PDFBox version 2.x, and PDFBox 2.0.0 in multiple ways is incompatible to its earlier 1.8.x releases.

You point towards the mvnrepository page for Apache Tika Parsers » 1.5 to support your claim. This page shows:

Screenshot

But all this means is that Tika 1.5 has a dependency on PDFBox 1.8.4 and that there now exists a PDFBox version 2.0.3. It does not mean that Tika 1.5 properly functions with PDFBox 2.0.3.

Looking at the pom file you'll see:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.4</version>
</dependency>

Thus, Tika 1.5 has been developed and compiled with PDFBox 1.8.4. If the PDFBox version numbering is sensible, you can hope for Tika 1.5 properly working with any PDFBox 1.8.x from x == 4 onwards.

But PDFBox development took the opportunity to overhaul the PDFBox architecture in their 2.0.0 release. Most likely, therefore, no program depending on a 1.x PDFBox version can function with PDFBox 2.x without changes.

According to the TIKA issue TIKA-1959, Tika can run with PDFBox 2.0.1 since version 1.13.


To make a long story short, therefore, you need at least version 1.13 if you want to use Tika with PDFBox 2.0.x.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Please be aware that the newer Tina may require other updates, too. Version update can be quite a hell. – mkl Sep 21 '16 at 11:47