0

I use some RSS feeds. Some of them don't have a description for their articles.

In order not to show just the title and no description for those articles, I would like to show for example the first two paragraphs of the actual article.

I experimented with stripos, file_get_contents but I have a problem. In most pages it works fine, but in other pages it grabs the first <p> tag (which can be for example a paragraph in the sidebar) and is irrelevant to the article that is mentioned in the RSS feed.

Any idea about how to grab the main content from a URL strictly in PHP or JavaScript?

Thanks in advance.

netcoder
  • 66,435
  • 19
  • 125
  • 142
olaf36
  • 67
  • 1
  • 12
  • If you need to capture it from whole pages you either must have vast resources and time or you're out of luck. Google tries this in their search-result, and even they get it wrong a lot of the time... You could write something for a small subset of known pages using `DOM` or someting, but don't expect any easy solutions. – Wrikken Jun 07 '11 at 20:53
  • Unfortunately, the web pages are not known.They don;t have the same structure. – olaf36 Jun 07 '11 at 20:59

2 Answers2

2

The first idea that comes to mind is to remove tags from within the p and then only use that section if the length of actual text within the paragraph is greater than a certain threshold. Maybe check for a certain number of [.?!] also. If the number isn't there, then go to the next one.

JoshuaRogers
  • 405
  • 2
  • 5
  • Well, one could it with regular expressions. I'd recommend http://simplehtmldom.sourceforge.net/ though. it makes working with the DOM extremely easy. – JoshuaRogers Jun 07 '11 at 21:24
0

You may also want to try scraping, which allows you to 'scrape' a page and parse its contents. http://simplehtmldom.sourceforge.net/ has a jQuery-like syntax and should quickly allow you to get just the content you want.

Scraping comes with its own caveats, as some sites may not look kindly on your harvesting of data and may block your attempts. You may want to look into the pluses and minuses of this method, but it is certainly powerful.

There's also info on scraping RSS feeds here: http://blog.5ubliminal.com/posts/rsscraping-scraping-rss-with-php-dom-xpath/, which I haven't tried.

EDIT: Wrikken's link is better than mine. Some good alternatives there.

Community
  • 1
  • 1
g_thom
  • 2,810
  • 2
  • 18
  • 18
  • Don't use SimpleHTMLDOM if you can avoid it: it's excruciatingly slower then alternatives, see [here](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html-with-php/3577662#3577662) for a list on SO. – Wrikken Jun 07 '11 at 21:02
  • Thanks for the resource, Wrikken. Will look at that page now. – g_thom Jun 07 '11 at 21:05
  • I believe that the solution might be a heuristic algorithm, in order to check where html density is higher and grab content from the part of the page. – olaf36 Jun 07 '11 at 21:24