0

I would just like to to know how other developers manage to properly get/extract the first image in the blog main content of a site from URL in the RSS feed. This is the way I think of since the RSS feeds don't have image URL of the post/blog item in it. Though I keep on seeing

<img src="http://feeds.feedburner.com/~r/CookingLight/EatingSmart/~4/sIG3nePOu-c" />

but it's only 1px image. Does this one has relevant value to the feed item or can I convert this to maybe the actual image? Here's the RSS http://feeds.cookinglight.com/CookingLight/EatingSmart?format=xml

Anyway, here's my attempt to extract the image using the url in the feeds:

function extact_first_image( $url ) {  
  $content = file_get_contents($url);

  // Narrow the html to get the main div with the blog content only.
  // source: http://stackoverflow.com/questions/15643710/php-get-a-div-from-page-x
  $PreMain = explode('<div id="main-content"', $content);
  $main = explode("</div>" , $PreMain[1] );

  // Regex that finds matches with img tags.
  $output = preg_match_all('/<img[^>]+src=[\'"]([^\'"]+)[\'"][^>]*>/i', $main[12], $matches);  

  // Return the img in html format.
  return $matches[0][0];  
}

$url = 'http://www.cookinglight.com/eating-smart/nutrition-101/foods-that-fight-fat'; //Sample URL from the feed.
echo extact_first_image($url);

Obvious downside of this function: It properly explodes if <div id="main-content" is found in the html. When there's another xml to parse with another structure, there will be another explode for that as well. It's very much static.

I guess its worth mentioning also is regarding the load time. When I perform loop through out the items in the feed, its even more longer.

I hope I made clear of the points. Feel free to drop in any ideas that could help optimize the solution perhaps.

Dbx
  • 23
  • 1
  • 8
  • Is [this](http://img4.cookinglight.com/i/2012/09/1209-corbis-diet-s.jpg?150:150) the image you are looking for at [this url](http://www.cookinglight.com/eating-smart/nutrition-101/foods-that-fight-fat)? – MattT Aug 17 '14 at 07:40
  • Yes. Were you able to get that from the RSS feeds? @MattT – Dbx Aug 18 '14 at 03:39

1 Answers1

1

The image urls are in the rss file, so you can get them just by parsing the xml. Each <item> element contains a <media:group> element that contains a <media:content> element. The url to the image for that item is in the "url" attribute of the <media:content> element. Here is some basic code (php) for extracting the image urls into an array:

$xml = simplexml_load_file("http://feeds.cookinglight.com/CookingLight/EatingSmart?format=xml");

$imageUrls = array();

foreach($xml->channel->item as $item)
{
    array_push($imageUrls, (string)$item->children('media', true)->group->content->attributes()->url);
}

Keep in mind, though, that the media doesn't necessarily have to be an image. It can be a video or an audio recording. There might even be more than one <media:group>. You can check the "type" attribute of the <media:content> element to see what it is.

MattT
  • 134
  • 6
  • I didn't realize I could actually download the file. I understand then what you mean about the media group & content. It was just there sitting all along. Though other rss file might have another structure. I have this another rss that don't have media element. I guess I'll have to place a condition if a valid image already exist in the description object or get the image in media group (if available in the rss.) – Dbx Aug 19 '14 at 04:14
  • Im gonna go ahead and have this as the answer. It helped me to an even clearer answer to my question :) – Dbx Aug 27 '14 at 02:14
  • RSS 2.0 has the `enclosure` tag as an optionaml sub-element of the `item` tag for media. `media` does not exist. https://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt – Franz Holzinger Mar 22 '21 at 21:52