2

I'm currently building a new online Feed Reader in PHP. One of the features I'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link> tag.

The problem is, the way I'm currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now I'm taking the CURL response and running it through simplexml_load_string, if it can't parse it I treat it as a website. Here is the code.

$xml = @simplexml_load_string( $site_found['content'] );

if( !$xml ) // this is a website, not a feed
{
    // handle website
}
else
{
    // parse feed
}

Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.

Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?

halfer
  • 19,824
  • 17
  • 99
  • 186
Pepper
  • 2,932
  • 4
  • 25
  • 26

4 Answers4

8

I would sniff for the various unique identifiers those formats have:

Atom: Source

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

RSS 0.90: Source

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">

Netscape RSS 0.91

<rss version="0.91">

etc. etc. (See the 2nd source link for a full overview).

As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.

You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)

If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.

what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Yep, they are suppose to have those tag identifiers. But there are so many badly formed feeds and different versions out there, I cant rely on it. Looking for the or tag is interesting. Ill test that out. – Pepper Mar 14 '10 at 17:35
  • @Pepper yes, maybe compile lists of tags to sniff for? `html` and `body` for HTML, `rdf` and `item` (IIRC) for RSS, `feed` for Atom.... – Pekka Mar 14 '10 at 17:57
3

I think your best choice is getting the Content-Type header as I assume that's the way firefox (or any other browser) does it. Besides, if you think about it, the Content-Type is indeed the way server tells user agents how to process the response content. Almost any decent HTTP server sends a correct Content-Type header.

Nevertheless you could try to identify rss/atom in the content as a second choice if the first one "fails"(this criteria is up to you).

An additional benefit is that you only need to request the header instead of the entire document, thus saving you bandwidth, time, etc. You can do this with curl like this:

<?php
 $ch = curl_init("http://sample.com/feed");
 curl_setopt($ch, CURLOPT_NOBODY, true); // this set the HTTP Request Method to HEAD instead GET(default) and the server only sends HTTP Header(no content).
 curl_exec($ch);
 $conType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

 if (is_rss($conType)){ // You need to implement is_rss($conType) function
    // TODO
 }elseif(is_html($conType)) { // You need to implement is_html($conType) function
    // Search a rss in html
 }else{
    // Error : Page has no rss/atom feed
 }
?>
rohit89
  • 5,745
  • 2
  • 25
  • 42
Abraham
  • 31
  • 1
2

Why not try to parse your data with a component built specifically to parse RSS/ATOM Feed, like Zend_Feed_Reader ?

With that, if the parsing succeeds, you'll be pretty sure that the URL you used is indeed a valid RSS/ATOM feed.


And I should add that you could use such a component to parse feed in order to extract their informations, too : no need to re-invent the wheel, parsing the XML "by hand", and dealing with special cases yourself.

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • Using simplexml_load_string and parsing by hand is working for me, it's detecting the difference between website and feed thats the issue. Thanks though ;) – Pepper Mar 14 '10 at 17:28
  • What if the feed is badly formed XML? Are you able to parse all of the extensions to feeds like tags and enclosures? Maybe you don't care about these things, but my experience is that feeds are not as standardized as you might expect and using an existing library will keep you from reinventing the wheel. – Jackson Miller Mar 14 '10 at 17:51
  • Ill give Zend_Feed_Reader a try. I tried SimplePie early in the project and I had a higher success rate parsing it myself. You're right about feeds not being standardized, its a mess out there. – Pepper Mar 14 '10 at 17:55
0

Use the Content-Type HTTP response header to dispatch to the right handler.

halfer
  • 19,824
  • 17
  • 99
  • 186
Jan Algermissen
  • 4,930
  • 4
  • 26
  • 39
  • I think his problem goes deeper, he needs to work with many RSS sources, many of which not even serving valid markup in their chosen format - let alone sending the correct content-type header. – Pekka Mar 14 '10 at 18:25