http://lab.arc90.com/experiments/readability/ is a very handy tool for viewing cluttered newspaper, journal and blog pages in a very readable manner. It does this by using some heuristcis and finding the relevant main text of a web page. Its source code is also available at http://lab.arc90.com/experiments/readability/js/readability.js
Some colleague of mine drew my attention to this as I was struggling with jQuery to grab the "main text" of any newspaper | journal | blog | etc. website. My current heuristic (and implementation in jQuery) uses something like (this is done inside a Firefox Jetpack package):
$(doc).find("div > p").each(function (index) {
var textStr = $(this).text();
/*
We need the pieces of text that are long and in natural language,
and not some JS code snippets
*/
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {
console.log(index);
console.log(textStr.length);
console.log(textStr);
$(this).attr("id", "clozefox_paragraph_" + index);
results.push(index);
wholeText = wholeText + " " + textStr;
}
});
So it is something loke "go grab the paragraphs inside DIVs and check for irrelevant strings like 'script'". I have tried this and most of the time it can grab the main text of web articles however I'd like to have a better heuristic or maybe a better jQuery selection mechanism (and even shorter?).
Do you have better suggestions?
PS: Maybe "Find the innermost DIVs (that is without any child elements of DIV type) and go grab their
s only" would be a better heuristic for my current purpose but I couldn't find out how to express this in jQuery.