4

I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header) from our application.

Since Tika 1.15, the default behaviour is now to "extract all embedded documents" TIKA-2096.

I want to be able to turn this behaviour off on our tika-server so that embedded documents are NOT extracted and I only get the text rendition of the main document contents.

Is it possible to do this via a tika-config.xml file, or do I need to do a custom build and subclass EmbeddedDocumentExtractor so that it doesn't do anything?

An answer to tika-parser-exclude-pdf-attachments indicates that you can turn this behaviour off by subclassing EmbeddedDocumentExtractor, but I'd like to check if it's possible to do this via tika-config.xml without having to do a custom build of the tika-server.

I have looked at Configuring Tika but there is no mention of embedded docs here.

1 Answers1

3

The answers in tika-parser-exclude-pdf-attachments are excellent for if you are calling Tika via code.

Previously there hasn't been a way to do this for embedded files in Tika Server, other than disabling the whole file type using EmptyParser with something like the below:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.EmptyParser">
            <mime-exclude>image/jpeg</mime-exclude>
            <mime-exclude>application/zip</mime-exclude>
        </parser>
    </parsers>
</properties>

This has become a common request, so I've added a feature coming up in Tika 1.25 (yet to be released) to allow for the skipping embedded files using a header setting:

curl -T test_recursive_embedded.docx http://localhost:9998/tika --header "Accept: text/html" --header "X-Tika-Skip-Embedded: true"

Any parser using the EmbeddedDocumentExtractor will honour this.

Dave Meikle
  • 226
  • 2
  • 5
  • Tika 1.25 was released on 30th November. We've tested the new "X-Tika-Skip-Embedded" header when calling tika end-point to extract text from docs with embedded docs, and it works a treat. Thank you Dave. – henrythewasp Dec 11 '20 at 11:06
  • One question I do have - when doing a multipart POST to /tika/form, this new header appears to be ignored. Is that expected? – henrythewasp Dec 11 '20 at 12:57
  • 1
    Glad to hear the new feature helps. Not sure how you are calling this but from looking at the code, it reads the attachment headers so a call like this passes it on: `curl -F "upload=@testMif.mif;headers='X-Tika-Skip-Embedded: true'" http://localhost:9998/tika/form` IIRC you need a modern version of curl to try this (above 7.58 or 7.59). – Dave Meikle Dec 12 '20 at 13:01
  • Yes, I figured that after reading the code too. We are calling the Tika service endpoint from a Java application with HttpClient - I have changed the existing multi-part POST request to a simpler PUT request and it all works as expected. – henrythewasp Dec 13 '20 at 14:28