2

http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source code is also available at http://lab.arc90.com/experiments/readability/js/readability.js

Some colleague of mine drew my attention to this as I was struggling with jQuery to grab the "main text" of any newspaper | journal | blog | etc. website. My current heuristic (and implementation in jQuery) uses something like (this is done inside a Firefox Jetpack package):

$(doc).find("div > p").each(function (index) {  
    var textStr = $(this).text();
/*
     We need the pieces of text that are long and in natural language,
     and not some JS code snippets
    */
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {    
    console.log(index);    
    console.log(textStr.length);
    console.log(textStr);
    $(this).attr("id", "clozefox_paragraph_" + index);
    results.push(index);

    wholeText = wholeText + " " + textStr;
}
});

So it is something loke "go grab the paragraphs inside DIVs and check for irrelevant strings like 'script'". I have tried this and most of the time it can grab the main text of web articles however I'd like to have a better heuristic or maybe a better jQuery selection mechanism (and even shorter?).

Do you have better suggestions?

PS: Maybe "Find the innermost DIVs (that is without any child elements of DIV type) and go grab their

s only" would be a better heuristic for my current purpose but I couldn't find out how to express this in jQuery.

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
  • Either I could not express myself clearly or most of the viewers think it is not easy to go beyond the functionality of the READABILITY js code... – Emre Sevinç Dec 22 '09 at 22:01
  • 1
    Cf. my question, http://stackoverflow.com/questions/1962389/what-is-the-state-of-the-art-in-html-content-extraction & maybe the tag html-content-extraction is relevant? – Charles Stewart Dec 26 '09 at 01:24
  • Charles, thank you very much for me directing me to your question and resources! :) – Emre Sevinç Jan 06 '10 at 14:32
  • not sure how feasible it'll be with Just jQuery... seems like a server-side language could make thins like text processing a lot easier... – Quang Van Mar 21 '11 at 08:35

2 Answers2

1

This is generally done by analyzing the "link density" of elements on a page. The higher the link density, the more likely it is not content. Here is a great place to get started with thinking about content extraction techniques and algorithms: http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents

lightyrs
  • 2,809
  • 2
  • 29
  • 32
0

Most articles have a rectangular column of text. Try taking some combination of the dimensions of the element and the number of words it (including children) contains. You probably want to favor narrow and tall divs.

Something like probability of main text = (number of words) * (height / width) would be a good start.

Nathan Rivera
  • 766
  • 5
  • 4