I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain"
header) from our application.
Since Tika 1.15, the default behaviour is now to "extract all embedded documents" TIKA-2096.
I want to be able to turn this behaviour off on our tika-server so that embedded documents are NOT extracted and I only get the text rendition of the main document contents.
Is it possible to do this via a tika-config.xml
file, or do I need to do a custom build and subclass EmbeddedDocumentExtractor
so that it doesn't do anything?
An answer to tika-parser-exclude-pdf-attachments indicates that you can turn this behaviour off by subclassing EmbeddedDocumentExtractor
, but I'd like to check if it's possible to do this via tika-config.xml
without having to do a custom build of the tika-server.
I have looked at Configuring Tika but there is no mention of embedded docs here.