Intelligently grab first paragraph/starting text

Question

I'd like to have a script where I can input a URL and it will intelligently grab the first paragraph of the article... I'm not sure where to begin other than just pulling text from within <p> tags. Do you know of any tips/tutorials on how to do this kind of thing?

update

For further clarification, I'm building a section of my site where users can submit links like on Facebook, it'll grab an image from their site as well as text to go with the link. I'm using PHP and trying to determine the best method of doing this.

I say "intelligently" because I'd like to try to get content on that page that's important, not just the first paragraph, but the first paragraph of the most important content.

Don't do it with a regex, you're just in for a world of pain, then. Besides, `
` doesn't have to be closed in normal HTML. — Joey, Jan 11 '11 at 15:14
On most sites the most important paragraph will be after the h1 tag. — pritaeas, Jan 11 '11 at 15:14
Are they only local URL? i.e. do you know the structure of the page you're grabbing in advance? — Nabab, Jan 11 '11 at 15:45
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Jan 11 '11 at 15:49
This cannot be answered unless we know what constitutes the "the first paragraph of the article". Also, you might want to elaborate on what you consider "intelligently". Without these clarification, your best bet is familiarizing yourself with one of the parsers given in the *related* link. — Gordon, Jan 11 '11 at 15:50

Nabab · Answer 1 · 2011-01-11T17:22:32.243

2

If the page you want to grab is foreign or even if it is local but that you don't know its structure in advance, I'd say the best to achieve this would be by using the php DOM functions.

function get_first_paragraph($url)
{
  $page = file_get_contents($url);
  $doc = new DOMDocument();
  $doc->loadHTML($page);
  /* Gets all the paragraphs */
  $p = $doc->getElementsByTagName('p');
  /* extracts the first one */
  $p = $p->items(0);
  /* returns the paragraph's content */
  return $p->textContent;
}

edited Jan 11 '11 at 17:22

answered Jan 11 '11 at 16:01

Nabab

2,608
1
19
32

1

what is `get_file_contents`? and why are you using it instead of `DOMDocument::loadHTMLFile()`? – Gordon Jan 11 '11 at 17:15
My bad: it's file_get_contents(), and I wasn't sure if DOMDocument::loadHTMLFile() works with remote URL – Nabab Jan 11 '11 at 17:21
Turns out that's a bad way to do it because it requires valid end tags and other things – Ben Jan 11 '11 at 22:00
2

@Webnet no, it doesnt. DOM handles broken HTML fine. – Gordon Jan 12 '11 at 14:49

Gordon · Answer 2 · 2011-01-11T17:40:10.087

Short answer: you can't.

In order to have a PHP script "intelligently" fetch the "most important" content from a page, the script would have to understand the content on the page. PHP is no natural language processor, nor is this a trivial area of study. There might be some NLP toolkits for PHP, but I still doubt it would be easy then.

A solution that can be achieved with reasonable effort would be fetch those entire page with an HTML parser and then look out for elements with certain class names or ids commonly found in blog engines. You could also parse for hAtom Microformats. Or you could look out for Meta tags within the document and more clearly defined information.

score 1 · Answer 3 · edited May 23 '17 at 11:50

1

I wrote a Python script a while ago to extract a web page's main article content. It uses a heuristic to scan all text nodes in a document and group together nodes at similar depths, and then assume the largest grouping is the main article.

Of course, this method has its limitations, and no method will work on 100% of web pages. This is just one approach, and there are many other ways you might accomplish it. You may also want to look at similar past questions on this subject.

edited May 23 '17 at 11:50

Community

1
1

answered Jan 12 '11 at 13:54

Cerin

60,957
96
316
522

This method has the best chance of working across disparate sites. Unless you know you're always parsing WordPress sites, for example, you can't make assumptions based on markup or structure. Heuristic analysis is the least error-prone method, but it can be defeated relatively easily by sites that make heavy use of Ajax or Javascript to render or manipulate the site's contents. Facebook uses a dictionary of site types (WordPress, YouTube, etc) and only falls back to analyzing the page when it doesn't have a predefined parser. Also, be aware of bit.ly links and other redirects/shorteners. – Karelzarath May 03 '11 at 16:25

Intelligently grab first paragraph/starting text

3 Answers3

Linked