0

I am parsing a rss feed to json using php.

using below code

my json output contains data out of description from item element but title and link data not extracting

  • problem is some where with incorrent CDATA or my code is not parsing it correctly.

xml is here

$blog_url = 'http://www.blogdogarotinho.com/rssfeedgenerator.ashx';

$rawFeed = file_get_contents($blog_url);
$xml=simplexml_load_string($rawFeed,'SimpleXMLElement', LIBXML_NOCDATA);

// step 2: extract the channel metadata
$articles = array();    

// step 3: extract the articles

foreach ($xml->channel->item as $item) {
    $article = array();

    $article['title'] = (string)trim($item->title);
    $article['link'] = $item->link;      
    $article['pubDate'] = $item->pubDate;
    $article['timestamp'] = strtotime($item->pubDate);
    $article['description'] = (string)trim($item->description);
    $article['isPermaLink'] = $item->guid['isPermaLink'];        

    $articles[$article['timestamp']] = $article;
}

echo json_encode($articles);
complex857
  • 20,425
  • 6
  • 51
  • 54
Rajnish Mishra
  • 826
  • 5
  • 21
  • If i run your example my output contains a bunch of `<![CDATA[` tags. However I'm not sure if you are seeing the same thing? Do you want them removed? Or you are not seeing their content at all? – complex857 Jun 01 '14 at 17:33
  • I am not getting any thing for title and link. it give me nothing – Rajnish Mishra Jun 01 '14 at 17:43
  • I think this could be because of different php/libxml versions (I'm running 5.5.12 here), tried it on php 5.4.29 and 5.3.23 too but got the same result. What PHP version are you on? – complex857 Jun 01 '14 at 17:54
  • @my localhost I am using 5.5.6 even on server. After parsing the xml to json I am getting a blank value for link and title both on localhost and server ... however I tried downloading the xml to a file and parsing that gives same result ..... One thing I tried is putting a
    tag after <![CDATA[ in xml for title I was able to success fully parse.... but still that dows not solves the issue
    – Rajnish Mishra Jun 01 '14 at 18:06
  • Note that [`trim()`](http://php.net/trim) always returns a string, so the `(string)` in `(string)trim($item->title)` is doing nothing; if anything, you would need to cast its *input*, which would be `trim((string)$item->title)`, although it will probably do that implicitly anyway. You should however cast your other values, e.g. `$article['link'] = (string)$item->link;` before passing them off to other functions. – IMSoP Jun 01 '14 at 18:37

1 Answers1

2

I think you are just the victim of the browser hiding the tags. Let me explain: Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit ctrl+u on the rss link in your browser and you will see:

<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
  <channel>
    <description>Blog do Garotinho</description>
    <item>
      <description>&lt;![CDATA[&lt;br&gt;
          Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]&gt;
      </description>
      <link>&lt;![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]&gt;</link>
...
      <title>&lt;![CDATA[A bancada dos caras de pau]]&gt;</title>
    </item>

As you can see the <title> for example starts with a &lt; which when will turn to a < when simplexml returns it for your json data. Now if you are looking the printed json data in a browser your browser will see the following:

"title":"<![CDATA[A bancada dos caras de pau]]>"

Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.

If you hit ctrl+u you should see the output printed as expected (i myself used a command line php file and did not notice this first).

Try this demo:

You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():

function clean_cdata($str) {
    return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
}

This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:

// ....
$article['title'] = clean_cdata($item->title);
// ....
complex857
  • 20,425
  • 6
  • 51
  • 54
  • Yeah, whatever is generating that XML is definitely doing it wrong. Ampersand-encoding (`>` etc) and CDATA are alternative escape mechanisms, but it's somehow using both at once. – IMSoP Jun 01 '14 at 18:39
  • Yup, I'm not sure either what was the idea behind including the `<![CDATA[` tags and *then* doing the entity encoding on the content with that included. However the xml file in itself is valid just have these pointless tags – complex857 Jun 01 '14 at 18:44
  • So how can I get clean json out of it do I need some string replace .. basically I am a java developer and this php thing it getting me out of mind.. I have seen your link but on the client end I should send clean json .. I mean without CDATA part. – Rajnish Mishra Jun 01 '14 at 18:59
  • Well, I'm afraid yes, I would probably try that too. You could try to contact the source of your rss feed and ask for explanation, maybe we are missing something. – complex857 Jun 01 '14 at 19:01
  • I would like to mark your answer but I am still confused sorry for that but it still deserves upvotes Thanks ... if you could suggest me some thing on thing I have spent my whole day on this CDATA issue..thanks – Rajnish Mishra Jun 01 '14 at 19:04
  • Well, I've added an example that relies on a simple regexp to clean these, I'm not sure how robust is it, it seem to work fine with the current input. – complex857 Jun 01 '14 at 19:30
  • I have managed to develop own function Thanks for the help. – Rajnish Mishra Jun 02 '14 at 17:06
  • @IMSoP: No, it's just HTML *inside* XML which is quite common with (RSS) feeds. It's so called encoded content. So you can not say that this is right. Not that this won't come with it's own problems, but parsing is pretty easy: Just parse the node-value with a HTML parser again. Done. – hakre Jun 19 '14 at 21:08
  • @hakre No, look again. It's *XML* inside XML (with `<![CDATA[` escaped as `<![CDATA[`) and then HTML inside that (decode the `<br>` once and you have `
    `, but that's *inside* the `<![CDATA[`, which would prevent the entities being decoded). So, the same method would work, but you have to unwrap *twice*. My strong suspicion is that encoded content was the *intention*, but the encoder got it wrong and double-wrapped it.
    – IMSoP Jun 20 '14 at 08:35
  • Okay it's three times. Which can happen with feeds. There is a longer write-up about this and similar issues with feeds in: http://www.intertwingly.net/wiki/pie/EscapedHtmlDiscussion and there was also another older write-up by sam ruby but I couldn't dig it up in time yet. – hakre Jun 20 '14 at 09:18