0

I am using Python and the RSS feedparser module to retrieve RSS entries. However I only want to retrieve a news item if it is no more than x days old.

For example if x=4 then my Python code should not fetch anything four days older than the current date.

Feedparser allows you to scrape the 'published' date for the entry, however it is of type unicode and I don't know how to convert this into a datetime object.

Here is some example input:

date = 'Thu, 29 May 2014 20:39:20 +0000'

Here is what I have tried:

from datetime import datetime
date_object = datetime.strptime(date, '%a, %d %b %Y %H:%M:%S %z')

This is the error I get:

ValueError: 'z' is a bad directive in format '%a, %d %b %Y %H:%M:%S %z'

This is what I hope to do with it:

from datetime import datetime
a = datetime(today)
b = datetime(RSS_feed_entry_date)
>>> a-b
datetime.timedelta(6, 1)
(a-b).days
6
Cœur
  • 37,241
  • 25
  • 195
  • 267
timebandit
  • 794
  • 2
  • 11
  • 26

2 Answers2

2

For this, you already have a time.struct_time look at feed.entries[0].published_parsed

you can use time.mktime to convert this to a timestamp and compare it with time.time() to see how far in the past it is:

An example:

>>> import feedparser
>>> import time

>>> f = feedparser.parse("http://feeds.bbci.co.uk/news/rss.xml")
>>> f.entries[0].published_parsed
time.struct_time(tm_year=2014, tm_mon=5, tm_mday=30, tm_hour=14, tm_min=6, tm_sec=8, tm_wday=4, tm_yday=150, tm_isdst=0)

>>> time.time() - time.mktime(feed.entries[0].published_parsed)
4985.511506080627

obviosuly this will be a different value for you, but if this is less than (in your case) 86400 * 4 (number of seconds in 4 days), it's what you want.

So, concisely

[entry for entry in f.entries if time.time() - time.mktime(entry.published_parsed) < (86400*4)]

would give you your list

Chris Clarke
  • 2,103
  • 2
  • 14
  • 19
  • very good thanks: I am ashamed to say that I did not fully understand the feedparser documentation, otherwise I would have picked up on this one. Thanks again!! – timebandit May 30 '14 at 15:42
  • This should work pretty well, but make sure you're careful are date field in feeds are often dirty and may trigger parsing errors. – Julien Genestoux Jun 01 '14 at 20:00
1
from datetime import datetime
date = 'Thu, 29 May 2014 20:39:20 +0000'
if '+' in date:
    dateSplit = date.split('+')
    offset = '+' + dateSplit[1]
    restOfDate = str(dateSplit[0])
date_object = datetime.strptime(restOfDate + ' ' + offset, '%a, %d %b %Y %H:%M:%S ' + offset)
print date_object

Yields 2014-05-29 20:39:20, as I was researching your timezone error I came across this other SO question that says that strptime has trouble with time zones (link to question).

Community
  • 1
  • 1
heinst
  • 8,520
  • 7
  • 41
  • 77