Tika with Grobid throwing error when parsing pdf document

Question

I am trying to extract both document metadata and journal header metadata from a pdf document. I verified that Tika Server (v1.21 / v1.24) and Grobid (v0.6.0) are independently able to extract metadata from the pdf document. However, when I run Grobid within Tika Server ( following instructions mentioned in https://cwiki.apache.org/confluence/display/TIKA/GrobidJournalParser ), I get the below error (snippet) for the same pdf document:

org.xml.sax.SAXParseException; Premature end of file.
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
        at org.apache.tika.utils.XMLReaderUtils.buildDOM(XMLReaderUtils.java:407)
        at org.apache.tika.parser.journal.TEIDOMParser.parse(TEIDOMParser.java:44)
        at org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:85)
        at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)
        at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
    ....

I ran the below command to start Tika Server with Grobid:

java -classpath /home/avlurs/grobid-0.6.0/grobidparser-resources/:tika-server-1.21.jar org
.apache.tika.server.TikaServerCli --config /home/avlurs/grobid-0.6.0/grobidparser-resources/tika-config.xml &

I ran the below command to test the metadata extraction:

curl -T /home/avlurs/temp/in/JournalTest.pdf -H "Content-Disposition: attachment;filename=
JournalTest.pdf" http://localhost:9998/rmeta

In addition to throwing the above mentioned error, I am getting the document metadata from Tika in the output. However, Grobid metadata is not being extracted.

Appreciate any inputs / suggestions to address this issue. Thanks.

score 0 · Answer 1 · answered Nov 10 '20 at 00:19

The Grobid service updated the location of their API endpoints to under /api in July 2017 but the GrobidParser wasn't updated to use the new location.

I've just committed a fix for this as part of TIKA-3191, which will be released in Tika 1.25. We're hoping to get that out in the next few week, but until then you can use a source build or a snapshot build.

I also plan to update the Tika GrobidParser Wiki Page to have more up to date instructions in place that explain using the current Gradle build and Docker image options Grobid has these days.

Apache Tika 1.25 has been released now with the fix included. I've also created a docker-compose based example [here](https://github.com/apache/tika-docker/blob/master/docker-compose-tika-grobid.yml) for anyone looking to try it out. — Dave Meikle, Dec 02 '20 at 22:12

Tika with Grobid throwing error when parsing pdf document

1 Answers1