3

I'd like to get the main image for an article, much like Facebook does when you post a link (but without the choosing image part). The data we have to work with is the whole pages HTML as a variable. The page & URL will be different for every time this function runs.

Are there any libraries or classes that are particularly good at getting the main body of content, much like Instapaper that would be of any help?

PaulAdamDavis
  • 1,574
  • 3
  • 16
  • 19
  • Please explain what you mean with "the main body of content" and "the main image for an article". How do you decide what's "main"? The first or biggest image in the DIV that takes the most space on screen? – rik Jan 13 '11 at 12:49
  • When I say the main body of content, I mean the article its self, the article, the news story, that. And by the main image, generally the image that's next to the article title. – PaulAdamDavis Jan 13 '11 at 12:56
  • possible duplicate of [Intelligently grab first paragraph/starting text](http://stackoverflow.com/questions/4659057/intelligently-grab-first-paragraph-starting-text) – Gordon Jan 13 '11 at 13:03
  • Again, what do you mean with "the article its self, the article, the news story"? The dupe is about "first paragraph". That's something one can express in code. Your personal opinion of "main" or "the" can not. – rik Jan 13 '11 at 16:07
  • i swear this question pops up every other day... – dqhendricks Jan 13 '11 at 16:39

1 Answers1

2

you can use PHP DOM classes to parse an HTML page. it would easily allow you to grab the first image and the h1 text.

you could also get more advanced with it, like cycle through the p tags to find the first p tag with over X number of characters, and use that for the main text. or you could grab the meta tags and get the description.

there are about a million different ways you could go with this, but PHP DOM is probably what you are looking for initially.

http://us.php.net/manual/en/book.dom.php

dqhendricks
  • 19,030
  • 11
  • 50
  • 83
  • Also, if the page is part of a feed, you may want to grab this info directly from the rss xml file, although this code would have to be pretty smart to do that correctly. – dqhendricks Jan 13 '11 at 16:45