How to Skip Headers and Footers Extraction using Apache Tika

Question

How to extract documents like (pdf,docx,doc,odt) without headers and footer using apache tika.

Please read [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). — P.J.Meisch, May 16 '17 at 06:42
look http://stackoverflow.com/questions/16862346/ignoring-header-footer-text-when-using-tika and https://coderanch.com/t/679868/Apache-Tika-Skipping-Header-footer . — Nicomedes E., May 16 '17 at 10:33
Grab as XHTML, strip out the Header and Footer divs, then downmix to plain text if required? — Gagravarr, May 16 '17 at 20:50
look http://stackoverflow.com/questions/42589076/apache-tika-how-to-extract-html-body-with-out-header-and-footer-content — User4567, May 18 '17 at 10:23

score 1 · Answer 1 · answered May 18 '17 at 10:28

I tested this code with all the file formats, some are parsing well(pdf and html) and not working for doc,docx,xlsx,xls formats

import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.html.BoilerpipeContentHandler;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.apache.tika.metadata.Metadata;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;   

public class NewtikaXpath {
    public static void main(String args[]) throws IOException, SAXException, TikaException {
        AutoDetectParser parser = new AutoDetectParser();
        ContentHandler textHandler = new BodyContentHandler();
        Metadata xmetadata = new Metadata();
        try  (InputStream stream = TikaInputStream.get(new URL("your favourite url"))){
            parser.parse(stream, new BoilerpipeContentHandler(textHandler), xmetadata);
            System.out.println("text:\n" + textHandler.toString());
        }
    }

}

Thanks For help. This code good works for html files. I need for doc,docx,odt and pdf @Lakshman — Ateeb Khan, May 19 '17 at 11:55

Asad · Answer 2 · 2019-07-04T19:57:54.500

1

You can do it pro-grammatically. Here is how and it's working for all tika supported documents including docx, pptx, odt pdf

   ParseContext parseContext = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
Metadata metadata = new Metadata();

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());

edited Jul 04 '19 at 19:57

answered Mar 01 '18 at 14:14

Asad

2,782
2
16
17

OfficeParserConfig.class is not available with tika-parser V1.4. May you please assist me on this. – AndroidHacker Apr 30 '18 at 05:26
I would recommend to update your Tika to 1.8. Thanks – Asad Apr 30 '18 at 10:43
tika-parser or tika-core ? – AndroidHacker Apr 30 '18 at 10:51
You should use tika-parser – Asad Apr 30 '18 at 20:52
Yeah.. I am using Tika Parser only.. but with no success – AndroidHacker Apr 30 '18 at 20:52

How to Skip Headers and Footers Extraction using Apache Tika

2 Answers2