3

Background
I'm trying to read and analyze content from web pages, with focus on the main content of the page - without menus, sidebars, scripts, and other HTML clutter.

What have I tried?

  • I've tried NReadability, but it throws exceptions and fails on too many cases. Other than that it is a good solution.
  • HTML Agility Pack is not what I need here, because I do want too get rid of non-content code.

EDIT: I'm looking for a library that actually sifts through content, and gives me only the "relevant" text from the page (i.e. for this page, the words "review", "chat", "meta", "about" , and "faq" from the top bar will not show, as well as "user contributions licensed under".

So, do you know any other stable .Net library for extracting content from websites?

Community
  • 1
  • 1
seldary
  • 6,186
  • 4
  • 40
  • 55

2 Answers2

7

I don't know if this is still relevant, but this is an interesting question I run into a lot, and I haven't seen much material on the web that covers it.

I've implemented a tool that does this over the span of several months myself. Out of contract obligation, I can not share this tool freely. However, I'm free to share some advice about what you can do.

The Sad Truth :(

I can assure you that we have tried every option before undertaking the task of creating a readability tool ourselves. At the moment no such tools exist that were satisfactory for what we needed.

So, you want to extract content?

Great! You will need a few things

  1. A tool for handling the page's HTML. I use CsQuery which is what Jamie suggested in the answer above. It works great for selecting elements.
  2. A programming language (That's C# in this example, any .NET language will do!)
  3. A tool that lets you download the pages themselves. CsQuery it on its own with createFromUrl. You can create your own helper class for downloading the page if you want to pre-process it and get finer grained control over the headers. (Try playing with the user agent, looking for mobile versions, etc)

Ok, I'm all set up, what's next?

There is surprisingly little research in the field of content extraction. A piece that stands out is Boilerplate Detection using Shallow Text Features. You can also read this answer here in StackOverflow from the paper's author to see how Readability works and what some approaches are.

Here are some more papers I enjoyed:

I'm done reading, what's done in practice

From my experience the following are good strategies for extracting content:

  • Simple heuristics: Filtering <header> and <nav> tags, removing lists with only links. Removing the entire <head> section. Giving negative/positive score to elements based on their name and removing the ones with the least score (for example, divs with a class that contains the name navigation might get get lower score). This is how readability works.

  • Meta-Content. Analyzing density of links to text, this is a powerful tool on its own, you can compare the amount of link text to html text and work with that, the most dense text is usually where the content is. CsQuery lets you compare the amount of text to the amount of text in nested link tags easily.

  • Templating. Crawl several pages on the same website and analyze the differences between them, the constants are usually the page layout, navigation and ads. You can usually filter based on similarities. This 'template' based approach is very effective. The trick is to come up with an efficient algorithm to keep track of templates and detect the template itself.

  • Natural language processing. This is probably the most advanced approach here, it is relatively simple with natural language processing tools to detect paragraphs, text structure and thus where the actual content starts and ends.

  • Learning, learning is a very powerful concept for this sort of task. In the most basic form this involves creating a program that 'guesses' HTML elements to remove on a set of pre-defined results from a website and learns which patterns is OK to remove. This approach works best on a machine-per-site from my experience.

  • Fixed list of selectors. Surprisingly, this is extremely potent and people tend to forget about it. If you are scraping from a specific few sites using selectors and manually extracting the content is probably the fastest thing to do. Keep it simple if you can :)

In Practice

Mix and match, a good solution usually involves more than one strategy, combining a few. We ended up with something quite complex because we use it for a complex task. In practice, content extraction is a really complicated task. Don't try creating something that is very general, stick to the content you need to scrape. Test a lot, unit tests and regression are very important for this sort of program, always compare and read the code of readability, it's pretty simple and it'll probably get you started.

Best of luck, let me know how this goes.

Community
  • 1
  • 1
Benjamin Gruenbaum
  • 270,886
  • 87
  • 504
  • 504
1

CsQuery: https://github.com/jamietre/csquery

It's a .NET 4 jQuery port. Getting rid of non-content nodes could be done a number of ways: the .Text method to just grab everything as a string; or filter for text nodes, e.g.

var dom = CQ.CreateFromUrl(someUrl); 
// or var dom = CQ.Create(htmlText);

IEnumerable<string> allTextStrings = dom.Select("*")
            .Contents()
            .Where(el => el.NodeType == NodeType.TEXT_NODE)
            .Select(el => el.NodeValue);

It works the same as jQuery, except, of course, you also have the .NET framework and LINQ to make your life easier. The Select selects all nodes in the DOM, then Contents selects all children of each (including text nodes). That's it for CsQuery; then with LINQ the Where filters for only text nodes, and the Select gets the actual text out of each node.

This will include a lot of whitespace, it returns everything. If you simply want a blob of text for the whole page, just

string text = dom.Select("body").Text();

will do it. The Text method coalesces whitespace so there will be a single space between each piece of actual text.

Jamie Treworgy
  • 23,934
  • 8
  • 76
  • 119
  • This seems like another form of HtmlAgilityPack. Looks nice, but not what I need - see my edit. – seldary Jun 08 '12 at 09:27
  • Oh - I thought your problem with HAP was difficulty in extracting the text vs. structural nodes. I am not sure how you would determine what qualifies as "relevant", this seems like an AI problem, but I would think you could do fairly well just by ignoring any text node that's below some arbitrary # of characters. Trying to decide what's "main content" and whats "sidebar", in any way other than just the simple size of the text, is going to be darn near impossible without actually knowing what you are looking for in the content. – Jamie Treworgy Jun 08 '12 at 10:05
  • NReadability, Instapaper, readability.com are a few examples of products that does just that (more or less). It is possible, and i'm not looking for something perfect which is impossible. – seldary Jun 08 '12 at 11:25
  • Well, if you don't need the benefits of a lot of heuristic analysis that those tools probably do (e.g. don't need perfection) I think you could do all right with a basic algorithm that would be easy to implement with csquery. e.g. strip out inline tags (e.g. span, i, b); keep all headers; and throw away remaining block elements that contain, say, less than 80 characters. I bet that would eliminate all the layout content on most web pages. Anyway I understand your problem now - too bad NReadability doesn't work better - but you might take a crack at implementing something basic. – Jamie Treworgy Jun 08 '12 at 12:48