3

I'd like to have a script where I can input a URL and it will intelligently grab the first paragraph of the article... I'm not sure where to begin other than just pulling text from within <p> tags. Do you know of any tips/tutorials on how to do this kind of thing?

update

For further clarification, I'm building a section of my site where users can submit links like on Facebook, it'll grab an image from their site as well as text to go with the link. I'm using PHP and trying to determine the best method of doing this.

I say "intelligently" because I'd like to try to get content on that page that's important, not just the first paragraph, but the first paragraph of the most important content.

Ben
  • 60,438
  • 111
  • 314
  • 488
  • Don't do it with a regex, you're just in for a world of pain, then. Besides, `

    ` doesn't have to be closed in normal HTML.

    – Joey Jan 11 '11 at 15:14
  • On most sites the most important paragraph will be after the h1 tag. – pritaeas Jan 11 '11 at 15:14
  • Are they only local URL? i.e. do you know the structure of the page you're grabbing in advance? – Nabab Jan 11 '11 at 15:45
  • 1
    *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Jan 11 '11 at 15:49
  • This cannot be answered unless we know what constitutes the "the first paragraph of the article". Also, you might want to elaborate on what you consider "intelligently". Without these clarification, your best bet is familiarizing yourself with one of the parsers given in the *related* link. – Gordon Jan 11 '11 at 15:50
  • Clarification added - thx for the link – Ben Jan 11 '11 at 16:01
  • or grab the head tags for title/description – dqhendricks Jan 11 '11 at 17:15
  • dqh - Yes, but if those don't exist... – Ben Jan 12 '11 at 18:27

3 Answers3

2

If the page you want to grab is foreign or even if it is local but that you don't know its structure in advance, I'd say the best to achieve this would be by using the php DOM functions.

function get_first_paragraph($url)
{
  $page = file_get_contents($url);
  $doc = new DOMDocument();
  $doc->loadHTML($page);
  /* Gets all the paragraphs */
  $p = $doc->getElementsByTagName('p');
  /* extracts the first one */
  $p = $p->items(0);
  /* returns the paragraph's content */
  return $p->textContent;
}
Nabab
  • 2,608
  • 1
  • 19
  • 32
1

Short answer: you can't.

In order to have a PHP script "intelligently" fetch the "most important" content from a page, the script would have to understand the content on the page. PHP is no natural language processor, nor is this a trivial area of study. There might be some NLP toolkits for PHP, but I still doubt it would be easy then.

A solution that can be achieved with reasonable effort would be fetch those entire page with an HTML parser and then look out for elements with certain class names or ids commonly found in blog engines. You could also parse for hAtom Microformats. Or you could look out for Meta tags within the document and more clearly defined information.

Gordon
  • 312,688
  • 75
  • 539
  • 559
1

I wrote a Python script a while ago to extract a web page's main article content. It uses a heuristic to scan all text nodes in a document and group together nodes at similar depths, and then assume the largest grouping is the main article.

Of course, this method has its limitations, and no method will work on 100% of web pages. This is just one approach, and there are many other ways you might accomplish it. You may also want to look at similar past questions on this subject.

Community
  • 1
  • 1
Cerin
  • 60,957
  • 96
  • 316
  • 522
  • This method has the best chance of working across disparate sites. Unless you know you're always parsing WordPress sites, for example, you can't make assumptions based on markup or structure. Heuristic analysis is the least error-prone method, but it can be defeated relatively easily by sites that make heavy use of Ajax or Javascript to render or manipulate the site's contents. Facebook uses a dictionary of site types (WordPress, YouTube, etc) and only falls back to analyzing the page when it doesn't have a predefined parser. Also, be aware of bit.ly links and other redirects/shorteners. – Karelzarath May 03 '11 at 16:25