22

Update

Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.

So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.


The Question

I download some pages from random sites, and now I want to analyze the textual content of the page.

The problem is that a web page have a lot of content like menus, publicity, banners, etc.

I want to try to exclude all that is not related with the content of the page.

Taking this page as example, I don't want the menus above neither the links in the footer.

Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.

At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).

The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)

Edit: I want to do this inside my Java code, not an external application (if this can be possible).

I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Community
  • 1
  • 1
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199

9 Answers9

23

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

There are a few ways to feed HTML into Boilerpipe and extract HTML.

You can use a URL:

ArticleExtractor.INSTANCE.getText(url);

You can use a String:

ArticleExtractor.INSTANCE.getText(myHtml);

There are also options to use a Reader, which opens up a large number of options.

pentaphobe
  • 337
  • 1
  • 4
  • 9
  • You know how to use BoilerPipe with an HTML page previous downloaded? – Renato Dinhani Aug 14 '11 at 05:04
  • Thanks, when I asked, I didn't see all the documentation and was getting an exception because I was using the `process()` method. I tried this lib and it appears good. At the moment, the default extractor presented the best results. – Renato Dinhani Aug 14 '11 at 15:11
  • but how do you filter content by some tag like div#id. When it extracts the content, it uses its own label. – surajz Oct 04 '11 at 16:29
8

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

Reader reader = ...
InputSource is = new InputSource(reader);

// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();

// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);

// iterate over all blocks (= segments as "ArticleExtractor" sees them) 
for (TextBlock block : getTextBlocks()) {
    // block.isContent() tells you if it's likely to be content or not 
    // block.getText() gives you the block's text
}

TextBlock has some more exciting methods, feel free to play around!

Christian Kohlschütter
  • 3,032
  • 1
  • 18
  • 13
6

There appears to be a possible problem with Boilerpipe. Why? Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.

So one can crudely classify web pages into three kinds in respect to Boilerpipe:

  1. a web page with a single article in it (Boilerpipe worthy!)
  2. a web with multiple articles in it, such as the front page of the New York times
  3. a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.

Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

Nathaniel Ford
  • 20,545
  • 20
  • 91
  • 102
Stefan
  • 61
  • 1
  • 1
1

You can use some libs like goose. It works best on articles/news. You can also check javascript code that does similar extraction as goose with the readability bookmarklet

Felipe Hummel
  • 4,674
  • 5
  • 32
  • 35
1

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.

On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

Aaron Foltz
  • 138
  • 2
  • 9
  • The problem with the other question is the way I was running over the elements and printing the text. I was printing a father element, and after a childr element. This way, the text comes out of order. It will never work that way. `Hello beautiful world! ` will print `Hello world! beautiful` – Renato Dinhani Aug 14 '11 at 15:17
  • I see. I don't see a way of doing that in order unless you just select the father element and get the text of that element only. – Aaron Foltz Aug 14 '11 at 15:37
1

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.

You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.

If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

getn_outchea
  • 110
  • 1
  • 6
0

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

David L-R
  • 197
  • 10
0

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:

Tag Soup

HTML Unit

Jared Ng
  • 4,891
  • 2
  • 19
  • 18
  • I believe you both misunderstand the sole purpose of HTML Unit. The OP needs to extract data, not to act like a GUI-less webbrowser. Related: http://stackoverflow.com/questions/3152138/what-are-the-pros-and-cons-of-the-leading-java-html-parsers – BalusC Aug 19 '11 at 20:50
  • HTML Unit is a powerful tool that can be used for tasks equivalent to screen scraping. I don't see the issue with using it for this task. – Jared Ng Aug 19 '11 at 20:52
  • 1
    The question is asking for *more* than HTML parsing or screen scraping. For example: [Boilerpipe](http://code.google.com/p/boilerpipe/), [Readability](http://www.readability.com/developers), [HTML::ContentExtractor](http://search.cpan.org/~jzhang/HTML-ContentExtractor-0.02/lib/HTML/ContentExtractor.pm). – David J. Aug 05 '12 at 16:58
0

You can filter the html junk and then parse the required details or use the apis of the existing site. Refer the below link to filter the html, i hope it helps. http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

Tushar
  • 469
  • 1
  • 7
  • 14