Best way to parse RSS/Atom feeds with PHP

Question

I'm currently using Magpie RSS but it sometimes falls over when the RSS or Atom feed isn't well formed. Are there any other options for parsing RSS and Atom feeds with PHP?

There is one problem with this request most Feed readers are using php's core XML readers and if the XML is not Well-Formatted as required by the XML standards it will fall over you could look at ones that don't use XML readers and use a Text Reader however the load on the server will dramatically increase. I know this is answered i'm just making people aware of the drawbacks of using XML feed readers — Barkermn01, Sep 10 '13 at 09:29

Brian Cline · Answer 1 · 2013-12-02T22:48:03.220

172

I've always used the SimpleXML functions built in to PHP to parse XML documents. It's one of the few generic parsers out there that has an intuitive structure to it, which makes it extremely easy to build a meaningful class for something specific like an RSS feed. Additionally, it will detect XML warnings and errors, and upon finding any you could simply run the source through something like HTML Tidy (as ceejayoz mentioned) to clean it up and attempt it again.

Consider this very rough, simple class using SimpleXML:

class BlogPost
{
    var $date;
    var $ts;
    var $link;

    var $title;
    var $text;
}

class BlogFeed
{
    var $posts = array();

    function __construct($file_or_url)
    {
        $file_or_url = $this->resolveFile($file_or_url);
        if (!($x = simplexml_load_file($file_or_url)))
            return;

        foreach ($x->channel->item as $item)
        {
            $post = new BlogPost();
            $post->date  = (string) $item->pubDate;
            $post->ts    = strtotime($item->pubDate);
            $post->link  = (string) $item->link;
            $post->title = (string) $item->title;
            $post->text  = (string) $item->description;

            // Create summary as a shortened body and remove images, 
            // extraneous line breaks, etc.
            $post->summary = $this->summarizeText($post->text);

            $this->posts[] = $post;
        }
    }

    private function resolveFile($file_or_url) {
        if (!preg_match('|^https?:|', $file_or_url))
            $feed_uri = $_SERVER['DOCUMENT_ROOT'] .'/shared/xml/'. $file_or_url;
        else
            $feed_uri = $file_or_url;

        return $feed_uri;
    }

    private function summarizeText($summary) {
        $summary = strip_tags($summary);

        // Truncate summary line to 100 characters
        $max_len = 100;
        if (strlen($summary) > $max_len)
            $summary = substr($summary, 0, $max_len) . '...';

        return $summary;
    }
}

edited Dec 02 '13 at 22:48

answered Oct 30 '08 at 17:47

Brian Cline

20,012
6
26
25

2

you have an end-tag with no start tag. ;) – Talvi Watia Jul 26 '10 at 22:45
132

Well, I had one, but it was being eaten by SO's code formatter since it had no empty line above it. On a related note, you did not start your sentence with a capital letter. ;) – Brian Cline Jul 27 '10 at 03:51
4

Please change `$feed_uri = $feed_or_url;` to `$feed_uri = $file_or_url;` ... other than that, thank you for this code! It works great! – Tim Jan 18 '12 at 19:02
5

Note that while this solution is great, it'll only parse RSS feeds in it's current form. Atom feeds will not be parsed due to their different schema. – András Szepesházi Jul 20 '12 at 17:36
9

Note that `eregi_replace` is now deprecated and has been replaced with `preg_replace` as well as `eregi` with `preg_match`. Documentations can be found [here](http://php.net/manual/en/function.preg-replace.php) and [here](http://php.net/manual/en/function.preg-match.php) respectively. – ITS Alaska Jun 25 '13 at 18:31
1

I don't understand what is `cookHtmlSummarySoup()` for? whay not use `strip_tags()`? – vladkras Nov 23 '13 at 08:49
1

@ITSAlaska Thanks for the reminder. I think even back when I posted this in 2008 it was old code. I've updated it with preg_match accordingly. – Brian Cline Dec 02 '13 at 22:43
1

@vladkras Good question. Not sure where that wacky method name came from, looks like someone here edited it. I much prefer a built-in, so I've updated this to use strip_tags(). Thanks for the tip. – Brian Cline Dec 02 '13 at 22:45

PJunior · Answer 2 · 2014-02-12T10:22:01.187

48

With 4 lines, I import a rss to an array.

$feed = implode(file('http://yourdomains.com/feed.rss'));
$xml = simplexml_load_string($feed);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

For a more complex solution

$feed = new DOMDocument();
 $feed->load('file.rss');
 $json = array();
 $json['title'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
 $json['description'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
 $json['link'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
 $items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');

 $json['item'] = array();
 $i = 0;

 foreach($items as $key => $item) {
 $title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
 $description = $item->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
 $pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
 $guid = $item->getElementsByTagName('guid')->item(0)->firstChild->nodeValue;

 $json['item'][$key]['title'] = $title;
 $json['item'][$key]['description'] = $description;
 $json['item'][$key]['pubdate'] = $pubDate;
 $json['item'][$key]['guid'] = $guid; 
 }

echo json_encode($json);

edited Feb 12 '14 at 10:22

answered Nov 03 '13 at 10:14

PJunior

2,649
1
33
29

2

I just tried it. It does not give an array – samayo Jan 16 '14 at 02:20
can u give me the rss feed that u are using? – PJunior Jan 18 '14 at 00:09
2

In case you're wondering. It looks like he's using a tumblr rss feed. Anytumblrsite.com/rss would give you the same output. – andrewk Apr 12 '14 at 07:00
5

Used the 4 lines, did a great job :) but then I rewrote the 1st line : `$feed = file_get_contents('http://yourdomains.com/feed.rss');` _might be less intensive than file + implode_ – Guidouil May 30 '14 at 12:47
3

one line, $feed = json_decode(json_encode(simplexml_load_file('http://news.google.com/?output=rss')), true); – Will Bowman Sep 18 '14 at 14:22
i really like the one-liner - was looking for something like that - what about error-handling? – Fluchtpunkt Jun 24 '15 at 19:42
Why on earth are we converting an object into an array??? – musicin3d May 29 '18 at 19:41

score 29 · Accepted Answer · edited Sep 26 '12 at 18:03

29

Your other options include:

edited Sep 26 '12 at 18:03

josh3736

139,160
33
216
263

answered Oct 30 '08 at 15:53

Philip Morton

129,733
38
88
97

5

Zend Feed http://framework.zend.com/manual/en/zend.feed.html – artur Feb 26 '10 at 21:57
195

I don't like such "answers", giving links without any comments. Looks like you google it and link to a few top results. Especially since the asker has some RSS experience and needs a *better* parser. – duality_ Jul 30 '11 at 13:49
3

In case somebody needs a little bit advice, Last RSS is the easiest among the three listed above. Only 1 file to "require", and can fetch the RSS within 5 lines, with a decent array output. – Raptor May 11 '14 at 05:53
picoFeed https://github.com/fguillot/picoFeed – gadelat Apr 17 '17 at 00:40
I've used two of them and LastRss seems not good enough providing a fully functional helper and SimplePie is a bit too complicated. I would like to try some others but comments to those libs are better for people to understand, not just links. – noob Jun 21 '17 at 11:27

score 27 · Answer 4 · answered Sep 12 '14 at 17:08

27

I would like introduce simple script to parse RSS:

$i = 0; // counter
$url = "http://www.banki.ru/xml/news.rss"; // url to parse
$rss = simplexml_load_file($url); // XML parser

// RSS items loop

print '<h2><img style="vertical-align: middle;" src="'.$rss->channel->image->url.'" /> '.$rss->channel->title.'</h2>'; // channel title + img with src

foreach($rss->channel->item as $item) {
if ($i < 10) { // parse only 10 items
    print '<a href="'.$item->link.'">'.$item->title.'</a><br />';
}

$i++;
}

answered Sep 12 '14 at 17:08

Vladimir Lukyanov

314
4
11

Clear and simple solution! Works nicely. – John T Nov 30 '19 at 04:28
rather than using $xml = simplexml_load_string($feed), this works pretty simple, in printing the data too ... – Srinivas08 Oct 06 '20 at 05:53

score 13 · Answer 5 · answered Nov 27 '08 at 13:30

13

If feed isn't well-formed XML, you're supposed to reject it, no exceptions. You're entitled to call feed creator a bozo.

Otherwise you're paving way to mess that HTML ended up in.

answered Nov 27 '08 at 13:30

Kornel

97,764
37
219
309

3

+1, you should not try to work around any XML that is not well-formed. We've had bad experiences with them, trust me, it was big pain :( – Helen Neely Oct 10 '09 at 23:00
36

However, programmers do not get to choose business partners and have to parse what they are given. – Edmond Meinfelder Jun 03 '11 at 00:21
2

What if you're building an universal RSS/Atom feed reader ? If any ill-formed xml file can "mess" your HTML, who is the Bozo ? ;) Be liberal in what you receive. – yPhil Sep 25 '13 at 11:40

score 6 · Answer 6 · answered Oct 30 '08 at 17:16

6

The HTML Tidy library is able to fix some malformed XML files. Running your feeds through that before passing them on to the parser may help.

answered Oct 30 '08 at 17:16

ceejayoz

176,543
40
303
368

score 1 · Answer 7 · answered Oct 30 '08 at 15:55

I use SimplePie to parse a Google Reader feed and it works pretty well and has a decent feature set.

Of course, I haven't tested it with non-well-formed RSS / Atom feeds so I don't know how it copes with those, I'm assuming Google's are fairly standards compliant! :)

score 1 · Answer 8 · answered Sep 27 '14 at 16:08

1

The PHP RSS reader - http://www.scriptol.com/rss/rss-reader.php - is a complete but simple parser used by thousand of users...

answered Sep 27 '14 at 16:08

Thinol

55
1
3

score 1 · Answer 9 · answered Apr 18 '10 at 12:34

1

Personally I use BNC Advanced Feed Parser- i like the template system that is very easy to use

answered Apr 18 '10 at 12:34

Adam

11
1

score -2 · Answer 10 · answered Feb 17 '14 at 10:02

-2

Another great free parser - http://bncscripts.com/free-php-rss-parser/ It's very light ( only 3kb ) and simple to use!

answered Feb 17 '14 at 10:02

Lucas

1

cant say its "great" using gzinflate and base64_decode, typically disabled for security. – Will Bowman Sep 18 '14 at 14:20
it's a dead link for marketing porpuses. – Sagive Jul 11 '20 at 11:39

Best way to parse RSS/Atom feeds with PHP

10 Answers10

Linked

Related