Trying to Parse Only the Images from an RSS Feed

Question

First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.

A small sampling of my rss feed reads like this:

 <channel>
 <atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
 <title>My Web Site</title>
 <description>My Feed</description>
 <link>http://mywebsite.com/</link>

 <image>
 <url>http://mywebsite.com/views/images/banner.jpg</url>
 <title>My Title</title>
 <link>http://mywebsite.com/</link>
 <description>Visit My Site</description>
 </image>

 <item>
 <title>Article One</title>
 <guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
 <link>http://mywebsite.com/geturl/e8c5106</link>
 <comments>http://mywebsite.com/details/e8c5106#comments</comments>     
 <pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate> 
 <category>Category 1</category>    
 <description>
      <![CDATA[<div>
      <img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0"  />  
      <ul><li>Poster: someone's name;</li>
      <li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
      <li>Rating: 5</li>
      <li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
      </description>
 </item> 
 <item>..

The image links that I want to parse out are the ones way inside each Item > Description

The code in my php file reads:

     <?php
 $xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
 $imgs = $xml->xpath('/item/description/img');
 foreach($imgs as $image) {
      echo $image->src;
 }
 ?>

Can someone please help me figure out how to configure the php code above?

Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?

Many thanks!!!

Hernando

score 3 · Accepted Answer · edited May 23 '17 at 12:04

The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.

The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like <img>. (I went into more technical details on another answer.)

So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.

$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');

$descriptions = $xml->xpath('//item/description');
foreach ( $descriptions as $description_node ) {
    // The description may not be valid XML, so use a more forgiving HTML parser mode
    $description_dom = new DOMDocument();
    $description_dom->loadHTML( (string)$description_node );

    // Switch back to SimpleXML for readability
    $description_sxml = simplexml_import_dom( $description_dom );

    // Find all images, and extract their 'src' param
    $imgs = $description_sxml->xpath('//img');
    foreach($imgs as $image) {
        echo (string)$image['src'];
    }
}

Wow... this is way more complex than I imagined... I tried it and I am getting and error that reads `Fatal error: Call to undefined function simplexml_load_dom()` Thank you very much for your help! — Hernandito, Jan 11 '13 at 20:52
Oh, looking at my code, I've spotted another mistake - this was only intended as a quick example of code structure rather than perfectly ready to run code - `$image->src` should be `$image['src']` — IMSoP, Jan 12 '13 at 01:18
In case it's not clear why that's wrong: `$image->src` would be appropriate for getting a child tag, like `http://example.com/foo.jpeg`; in this case, we're getting an attribute, so `$image['src']`, e.g. `` — IMSoP, Jan 12 '13 at 01:22
It worked perfectly...!! Thank you for your help. I would never have figured this on my own. — Hernandito, Jan 12 '13 at 02:46

score 0 · Answer 2 · answered Jan 09 '13 at 21:37

0

I don't have much experience with xPath, but you could try the following:

$imgs = $xml->xpath('item//img');

This will select all img-elements which are inside item-elements, regardless if there are other elements inbetween. Removing the leading slash will search for item anywhere in the documet, not just from the root. Otherwise, you'd need something like /rss/channel/item....

As for displaying the images: Just output <img>-tags followed by line-breaks, like so:

foreach($imgs as $image) {
    echo '<img src="' . $image->src . '" /><br />';
}

The preferred way would be to use CSS instead of <br>-tags, but I think they are simpler for a start.

answered Jan 09 '13 at 21:37

lethal-guitar

4,438
1
20
40

Thank you... no luck though. To try your theory, I even tried this, but it did not return anything ` xpath('image'); foreach($imgs as $image) { echo $image->url; } ?> ` This should have returned the very first image, but it was empty. – Hernandito Jan 09 '13 at 21:51
Take a look at the note about the $filename parameter: http://php.net/manual/en/function.simplexml-load-file.php Maybe this applies to you, since you're passing parameters in the URL? – lethal-guitar Jan 09 '13 at 21:57
You could also load the file into a string and output it again, to verify that PHP fetches it correctly: `$data = file_get_contents(/*Your URL*/); echo $data;` – lethal-guitar Jan 09 '13 at 21:58
By adding two leading slashes to the xpath like this //image and >urlit worked for the top level image. But is still does not work for the //img >src that is inside description. – Hernandito Jan 09 '13 at 22:01
I tried `$data = file_get_contents(/*Your URL*/); echo $data;` and it all the images and text properly show up... There is a little scramble I am guessing because of the <![CDATA[
but in looking at the code I do see the inide .
– Hernandito Jan 09 '13 at 22:10

Trying to Parse Only the Images from an RSS Feed

2 Answers2

Linked