140

I'm currently using Magpie RSS but it sometimes falls over when the RSS or Atom feed isn't well formed. Are there any other options for parsing RSS and Atom feeds with PHP?

Dan Lowe
  • 51,713
  • 20
  • 123
  • 112
carson
  • 5,751
  • 3
  • 24
  • 25
  • 1
    There is one problem with this request most Feed readers are using php's core XML readers and if the XML is not Well-Formatted as required by the XML standards it will fall over you could look at ones that don't use XML readers and use a Text Reader however the load on the server will dramatically increase. I know this is answered i'm just making people aware of the drawbacks of using XML feed readers – Barkermn01 Sep 10 '13 at 09:29
  • 1
    Never try to parse invalid XML. Blame the source. – Lothar Nov 27 '14 at 04:03

10 Answers10

172

I've always used the SimpleXML functions built in to PHP to parse XML documents. It's one of the few generic parsers out there that has an intuitive structure to it, which makes it extremely easy to build a meaningful class for something specific like an RSS feed. Additionally, it will detect XML warnings and errors, and upon finding any you could simply run the source through something like HTML Tidy (as ceejayoz mentioned) to clean it up and attempt it again.

Consider this very rough, simple class using SimpleXML:

class BlogPost
{
    var $date;
    var $ts;
    var $link;

    var $title;
    var $text;
}

class BlogFeed
{
    var $posts = array();

    function __construct($file_or_url)
    {
        $file_or_url = $this->resolveFile($file_or_url);
        if (!($x = simplexml_load_file($file_or_url)))
            return;

        foreach ($x->channel->item as $item)
        {
            $post = new BlogPost();
            $post->date  = (string) $item->pubDate;
            $post->ts    = strtotime($item->pubDate);
            $post->link  = (string) $item->link;
            $post->title = (string) $item->title;
            $post->text  = (string) $item->description;

            // Create summary as a shortened body and remove images, 
            // extraneous line breaks, etc.
            $post->summary = $this->summarizeText($post->text);

            $this->posts[] = $post;
        }
    }

    private function resolveFile($file_or_url) {
        if (!preg_match('|^https?:|', $file_or_url))
            $feed_uri = $_SERVER['DOCUMENT_ROOT'] .'/shared/xml/'. $file_or_url;
        else
            $feed_uri = $file_or_url;

        return $feed_uri;
    }

    private function summarizeText($summary) {
        $summary = strip_tags($summary);

        // Truncate summary line to 100 characters
        $max_len = 100;
        if (strlen($summary) > $max_len)
            $summary = substr($summary, 0, $max_len) . '...';

        return $summary;
    }
}
Brian Cline
  • 20,012
  • 6
  • 26
  • 25
  • 2
    you have an end-tag with no start tag. ;) – Talvi Watia Jul 26 '10 at 22:45
  • 132
    Well, I had one, but it was being eaten by SO's code formatter since it had no empty line above it. On a related note, you did not start your sentence with a capital letter. ;) – Brian Cline Jul 27 '10 at 03:51
  • 4
    Please change `$feed_uri = $feed_or_url;` to `$feed_uri = $file_or_url;` ... other than that, thank you for this code! It works great! – Tim Jan 18 '12 at 19:02
  • 5
    Note that while this solution is great, it'll only parse RSS feeds in it's current form. Atom feeds will not be parsed due to their different schema. – András Szepesházi Jul 20 '12 at 17:36
  • 9
    Note that `eregi_replace` is now deprecated and has been replaced with `preg_replace` as well as `eregi` with `preg_match`. Documentations can be found [here](http://php.net/manual/en/function.preg-replace.php) and [here](http://php.net/manual/en/function.preg-match.php) respectively. – ITS Alaska Jun 25 '13 at 18:31
  • 1
    I don't understand what is `cookHtmlSummarySoup()` for? whay not use `strip_tags()`? – vladkras Nov 23 '13 at 08:49
  • 1
    @ITSAlaska Thanks for the reminder. I think even back when I posted this in 2008 it was old code. I've updated it with preg_match accordingly. – Brian Cline Dec 02 '13 at 22:43
  • 1
    @vladkras Good question. Not sure where that wacky method name came from, looks like someone here edited it. I much prefer a built-in, so I've updated this to use strip_tags(). Thanks for the tip. – Brian Cline Dec 02 '13 at 22:45
48

With 4 lines, I import a rss to an array.

$feed = implode(file('http://yourdomains.com/feed.rss'));
$xml = simplexml_load_string($feed);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

For a more complex solution

$feed = new DOMDocument();
 $feed->load('file.rss');
 $json = array();
 $json['title'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
 $json['description'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
 $json['link'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
 $items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');

 $json['item'] = array();
 $i = 0;

 foreach($items as $key => $item) {
 $title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
 $description = $item->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
 $pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
 $guid = $item->getElementsByTagName('guid')->item(0)->firstChild->nodeValue;

 $json['item'][$key]['title'] = $title;
 $json['item'][$key]['description'] = $description;
 $json['item'][$key]['pubdate'] = $pubDate;
 $json['item'][$key]['guid'] = $guid; 
 }

echo json_encode($json);
PJunior
  • 2,649
  • 1
  • 33
  • 29
29

Your other options include:

josh3736
  • 139,160
  • 33
  • 216
  • 263
Philip Morton
  • 129,733
  • 38
  • 88
  • 97
  • 5
    Zend Feed http://framework.zend.com/manual/en/zend.feed.html – artur Feb 26 '10 at 21:57
  • 195
    I don't like such "answers", giving links without any comments. Looks like you google it and link to a few top results. Especially since the asker has some RSS experience and needs a *better* parser. – duality_ Jul 30 '11 at 13:49
  • 3
    In case somebody needs a little bit advice, Last RSS is the easiest among the three listed above. Only 1 file to "require", and can fetch the RSS within 5 lines, with a decent array output. – Raptor May 11 '14 at 05:53
  • picoFeed https://github.com/fguillot/picoFeed – gadelat Apr 17 '17 at 00:40
  • I've used two of them and LastRss seems not good enough providing a fully functional helper and SimplePie is a bit too complicated. I would like to try some others but comments to those libs are better for people to understand, not just links. – noob Jun 21 '17 at 11:27
27

I would like introduce simple script to parse RSS:

$i = 0; // counter
$url = "http://www.banki.ru/xml/news.rss"; // url to parse
$rss = simplexml_load_file($url); // XML parser

// RSS items loop

print '<h2><img style="vertical-align: middle;" src="'.$rss->channel->image->url.'" /> '.$rss->channel->title.'</h2>'; // channel title + img with src

foreach($rss->channel->item as $item) {
if ($i < 10) { // parse only 10 items
    print '<a href="'.$item->link.'">'.$item->title.'</a><br />';
}

$i++;
}
13

If feed isn't well-formed XML, you're supposed to reject it, no exceptions. You're entitled to call feed creator a bozo.

Otherwise you're paving way to mess that HTML ended up in.

Kornel
  • 97,764
  • 37
  • 219
  • 309
  • 3
    +1, you should not try to work around any XML that is not well-formed. We've had bad experiences with them, trust me, it was big pain :( – Helen Neely Oct 10 '09 at 23:00
  • 36
    However, programmers do not get to choose business partners and have to parse what they are given. – Edmond Meinfelder Jun 03 '11 at 00:21
  • 2
    What if you're building an universal RSS/Atom feed reader ? If any ill-formed xml file can "mess" your HTML, who is the Bozo ? ;) Be liberal in what you receive. – yPhil Sep 25 '13 at 11:40
6

The HTML Tidy library is able to fix some malformed XML files. Running your feeds through that before passing them on to the parser may help.

ceejayoz
  • 176,543
  • 40
  • 303
  • 368
1

I use SimplePie to parse a Google Reader feed and it works pretty well and has a decent feature set.

Of course, I haven't tested it with non-well-formed RSS / Atom feeds so I don't know how it copes with those, I'm assuming Google's are fairly standards compliant! :)

1

The PHP RSS reader - http://www.scriptol.com/rss/rss-reader.php - is a complete but simple parser used by thousand of users...

Thinol
  • 55
  • 1
  • 3
1

Personally I use BNC Advanced Feed Parser- i like the template system that is very easy to use

Adam
  • 11
  • 1
-2

Another great free parser - http://bncscripts.com/free-php-rss-parser/ It's very light ( only 3kb ) and simple to use!

Lucas
  • 1