I am trying to extract both document metadata and journal header metadata from a pdf document. I verified that Tika Server (v1.21 / v1.24) and Grobid (v0.6.0) are independently able to extract metadata from the pdf document. However, when I run Grobid within Tika Server ( following instructions mentioned in https://cwiki.apache.org/confluence/display/TIKA/GrobidJournalParser ), I get the below error (snippet) for the same pdf document:
org.xml.sax.SAXParseException; Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.apache.tika.utils.XMLReaderUtils.buildDOM(XMLReaderUtils.java:407)
at org.apache.tika.parser.journal.TEIDOMParser.parse(TEIDOMParser.java:44)
at org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:85)
at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
....
I ran the below command to start Tika Server with Grobid:
java -classpath /home/avlurs/grobid-0.6.0/grobidparser-resources/:tika-server-1.21.jar org
.apache.tika.server.TikaServerCli --config /home/avlurs/grobid-0.6.0/grobidparser-resources/tika-config.xml &
I ran the below command to test the metadata extraction:
curl -T /home/avlurs/temp/in/JournalTest.pdf -H "Content-Disposition: attachment;filename=
JournalTest.pdf" http://localhost:9998/rmeta
In addition to throwing the above mentioned error, I am getting the document metadata from Tika in the output. However, Grobid metadata is not being extracted.
Appreciate any inputs / suggestions to address this issue. Thanks.