Tika - retrieve main content from docs

Question

GUI utility of Apache Tika provides an option for getting main content ( apart from format text and structured text ) of the given document or the URL. I just want to know which method is responsible for extracting the main content of the docs/url. So that I can incorporate that method in my program. Also whether they are using any heuristic algorithm while extracting data from HTML pages. Because sometimes in the extracted content, I can't able to see the advertisements.

UPDATE : I found out that BoilerPipeContentHandler is responsible for it.

Have provided a solution usin boilerpipe at the below question. http://stackoverflow.com/questions/42589076/apache-tika-how-to-extract-html-body-with-out-header-and-footer-content — Trinadh Gupta, Mar 08 '17 at 04:33

score 8 · Accepted Answer · answered Feb 08 '12 at 15:11

8

The "main content" feature in the Tika GUI is implemented using the BoilerpipeContentHandler class that relies on the boilerpipe library for the heavy lifting.

answered Feb 08 '12 at 15:11

Jukka Zitting

1,092
6
13

will it work only for HTML pages or all. Because from the Boilerpipe docs, I can see it mainly support HTML pages only. – CrazyCoder Feb 09 '12 at 04:22
Also can you help tell me how to control whitespaces and newline in Tika output. Because output of tika contains more whitespace and newline – CrazyCoder Feb 09 '12 at 04:32

score 0 · Answer 2 · answered Aug 13 '14 at 05:31

public String[] tika_autoParser() {
    String[] result = new String[3];
    try {
        InputStream input = new FileInputStream(new File(path));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(input, textHandler, metadata, context);
        result[0] = "Title: " + metadata.get(metadata.TITLE);
        result[1] = "Body: " + textHandler.toString();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

    return result;
}

score 0 · Answer 3 · answered Feb 07 '12 at 14:37

0

I believe this is powered by the BodyContentHandler, which fetches just the HTML contents of the document body. This can additionally be combined with other handlers to return just the plain text of the body, if required.

answered Feb 07 '12 at 14:37

Gagravarr

47,320
10
111
156

It works in all the unit tests... I'd suggest you take a look at how it's used in those, and compare that to your use – Gagravarr Feb 08 '12 at 12:37
@Gagravarr "main content" is plain text with stripped boilerplate stuff (try to experiment with Tika gui to see what it is). – Alexey Tigarev Apr 27 '14 at 21:07

Tika - retrieve main content from docs

3 Answers3