7

Using feedparser or some other Python library to download and parse RSS feeds; how can I reliably detect new items and modified items?

So far I have seen new items in feeds with publication dates earlier than the latest item. Also I have seen feed readers displaying the same item published with slightly different content as seperate items. I am not implementing a feed reader application, I just want a sane strategy for archiving feed data.

Raj
  • 22,346
  • 14
  • 99
  • 142
muhuk
  • 15,777
  • 9
  • 59
  • 98

2 Answers2

6

It depends on how much you trust the feed source. feedparser provides an .id attribute for feed items -- this attribute should be unique for both RSS and ATOM sources. For an example, see eg feedparser's ATOM docs. Though .id will cover most cases, it's conceivable that a source might publish multiple items with the same id. In that case, you don't have much choice but to hash the item's content.

lt_kije
  • 425
  • 2
  • 4
  • hashing the contents can be feasible in my case. Would item.title & item.content be enough? – muhuk Apr 01 '09 at 07:25
  • Probably. Some feeds I follow change the title on identical items without changing the content; in those cases, I might only care about hashing by content. It depends on what you consider 'fundamental' about each item. – lt_kije Apr 01 '09 at 11:03
  • In any case, the solution would be to keep track of all "old" data on the receiving end, right? Either I keep track of the IDs I've processed or the hash values of the entries I've already processed. There's no way to identify a new entry without checking every entry in the RSS feed or trusting the feed's timestamps? – Alan Plum May 09 '11 at 23:40
1

There are two HTTP Features in the documentation for feedparser that can accomplish this:

1. Using ETags to reduce bandwidth

The basic concept is that a feed publisher may provide a special HTTP header, called an ETag, when it publishes a feed. You should send this ETag back to the server on subsequent requests. If the feed has not changed since the last time you requested it, the server will return a special HTTP status code (304) and no feed data.

    import feedparser
    d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
    d.etag``'"6c132-941-ad7e3080"'``
    d2 = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', etag=d.etag)
    d2.status``304``
    d2.feed``{}``
    d2.entries``[]``
    d2.debug_message``'The feed has not changed since you last checked, so
    the server sent no data.  This is a feature, not a bug!'

2. Using Last-Modified headers to reduce bandwidth

In this case, the server publishes the last-modified date of the feed in the HTTP header. You can send this back to the server on subsequent requests, and if the feed has not changed, the server will return HTTP status code 304 and no feed data.

import feedparser
d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
d.modified``(2004, 6, 11, 23, 0, 34, 4, 163, 0)``
d2 = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', modified=d.modified)
d2.status``304``
d2.feed``{}``
d2.entries``[]``
d2.debug_message``'The feed has not changed since you last checked, so
the server sent no data.  This is a feature, not a bug!'
Emma
  • 27,428
  • 11
  • 44
  • 69
Ron Hudson
  • 11
  • 1