6

Wondering if anybody could point me in the direction of academic papers or related implementations of heuristic approaches to finding the real meat content of a particular webpage.

Obviously this is not a trivial task, since the problem description is so vague, but I think that we all have a general understanding about what is meant by the primary content of a page.

For example, it may include the story text for a news article, but might not include any navigational elements, legal disclaimers, related story teasers, comments, etc. Article titles, dates, author names, and other metadata fall in the grey category.

I imagine that the application value of such an approach is large, and would expect Google to be using it in some way in their search algorithm, so it would appear to me that this subject has been treated by academics in the past.

Any references?

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
Kevin Dolan
  • 4,952
  • 3
  • 35
  • 47

1 Answers1

3

One way to look at this would be as an information extraction problem.

As such, one high-level algorithm would be to collect multiple examples of the same page type and deduce parsing (or extraction) rules for the parts of the page which are different (this is likely to be the main topic). The intuition is that common boilerplate (header, footer, etc) and ads will eventually appear on multiple examples of those web pages, so by training on a few of them, you can quickly start to reliably identify this boilerplate/additional code and subsequently ignore it. It's not foolproof, but this is also the basis of web scraping technologies, both commercial and academic, like RoadRunner:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8672&rep=rep1&type=pdf

The citation is:

Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB 2001: 109-118

There's also a well-cited survey of extraction technologies:

Alberto H. F. Laender , Berthier A. Ribeiro-Neto , Altigran S. da Silva , Juliana S. Teixeira, A brief survey of web data extraction tools, ACM SIGMOD Record, v.31 n.2, June 2002 [doi>10.1145/565117.565137]

kvista
  • 5,039
  • 1
  • 23
  • 25