Tika Parser: Exclude PDF Attachments

Question

There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config?

How are you combining Tika with SOLR? If using Java code (the recommended way), just don't supply a recursing parser! — Gagravarr, Jun 12 '18 at 14:36
Yes, basically in Java code but the default tika config is being used. I cannot determine what the default parser for PDFs is (I assume PDFParser) and I do not see if this parser is recursing or not or how to configure that... — Daniel S., Jun 12 '18 at 15:38
Without your java code, there's not much we can do to help you! — Gagravarr, Jun 12 '18 at 15:53
The auto detect parser with the default Tika config is being used. I am nut sure what parser is being picked for PDFs under the hood, probably the PDFParser?AutoDetectParser autoDetectParser = new AutoDetectParser(tikaConfig); — Daniel S., Jun 13 '18 at 15:22

Tim Allison · Answer 1 · 2018-06-13T16:13:39.313

@gagravarr, we changed that behavior via TIKA-2096, Tika 1.15. The default is now "extract all embedded documents". To avoid parsing embedded documents call:

parseContext.set(Parser.class, new EmptyParser())

Or subclass EmbeddedDocumentExtractor to do nothing and send that in via the ParseContext.

If you were using Solr DIH's TikaEntityProcessor, I'd set extractEmbedded to false, but you aren't; and please don't. :)

So, I don't think there's an easy way to turn off parsing of embedded documents only for PDFs, and I'm not sure you'd want to. What if there were an MSWord file attached to a PDF, for example?

If you want to ignore .joboptions, you could use a custom EmbeddedDocumentExtractor.

score 1 · Accepted Answer · answered Jun 19 '18 at 08:17

Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.

Example DocumentSelector:

public class CustomDocumentSelector implements DocumentSelector {

  @Override
  public boolean select(Metadata metadata) {
    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
    return resourceName == null || !resourceName.endsWith(".joboptions");
  }
}

Register it at the ParseContext:

parseContext.set(DocumentSelector.class, new CustomDocumentSelector());

Tika Parser: Exclude PDF Attachments

2 Answers2

Linked