3

I am using feedparser in order to get RSS data. Here is my code :

>>> import datetime
>>> import time
>>> import feedparser

>>> d=feedparser.parse("http://.../rss.xml")

>>> datetimee_rss = d.entries[0].published_parsed

>>> datetimee_rss
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=8, tm_hour=16, tm_min=57, tm_sec=39, tm_wday=4, tm_yday=128, tm_isdst=0)

>>> datetime.datetime.fromtimestamp(time.mktime(datetimee_rss))
datetime.datetime(2015, 5, 8, 17, 57, 39)

In my timezone (FR), the actual date is May, 8th, 2015 18:57.

In the RSS XML, the value is <pubDate>Fri, 08 May 2015 18:57:39 +0200</pubDate>

When I parse it into datetime, I got 2015, 5, 8, 17, 57, 39.

How to have 2015, 5, 8, 18, 57, 39 without dirty hack, but simply by configuring the correct timezone ?

EDIT:

By doing :

>>> from pytz import timezone

>>> datetime.datetime.fromtimestamp(time.mktime(datetimee_rss),tz=timezone('Euro
pe/Paris'))
datetime.datetime(2015, 5, 8, 17, 57, 39, tzinfo=<DstTzInfo 'Europe/Paris' CEST+2:00:00 DST>)

I got something nicer, however, it doesn't seem to work in the rest of the script, I got plenty of TypeError: can't compare offset-naive and offset-aware datetimes error.

Blusky
  • 3,470
  • 1
  • 19
  • 35
  • I don't know how feed parser handles those dates, but the resulting datetimes and time tuples aren't actually tz aware at all. – jwilner May 08 '15 at 18:39
  • 1
    Aside from your Python problem, you should note that timestamps in RSS feeds are generally very messy and you should probably not "trust" them by default. Several services cheat by using their "discovery" date for news items. – Julien Genestoux May 09 '15 at 19:13
  • @JulienGenestoux I've thought of that. I'll try it if I got too much complication in live environement :-) Thx ! – Blusky May 10 '15 at 09:26

3 Answers3

2

feedparser does provide the original datetime string (just remove the _parsed suffix from the attribute name), so if you know the format of the string, you can parse it into a tz-aware datetime object yourself.

For example, with your code, you can get the tz-aware object as such:

datetime.datetime.strptime(d.entries[0].published, '%a, %d %b %Y %H:%M:%S %z')

for more reference on strptime(), see https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

EDIT: Since Python 2.x doesn't support %z directive, use python-dateutil instead

pip install python-dateutil

then

from dateutil import parser
datetime_rss = parser.parse(d.entries[0].published)

documentation at https://dateutil.readthedocs.org/en/latest/

oxymor0n
  • 1,089
  • 7
  • 15
  • I got the following error :`ValueError: 'z' is a bad directive in format '%a, %d %b %Y %H:%M:%S %z'` – Blusky May 08 '15 at 19:24
  • If you are using Python 2.x, it doesn't support `%z`. Please check the updated answer above. – oxymor0n May 08 '15 at 19:46
  • It's working in console, but the given datetime format generate exception when manipulating with other datetime such as "datatime.now()" : `TypeError: can't compare offset-naive and offset-aware datetimes` (same as in my edited question). Is it possible to convert "offset-aware" to "offset-naive" ? – Blusky May 08 '15 at 21:46
  • @Blusky you can't compare a tz-aware datetime object with a naive one, since the later don't have any timezone information. Do you have any idea what the timezone of your naive datetime object is? – oxymor0n May 08 '15 at 22:21
  • local tz is 'Europe/Paris', the datetime is provided by datetime.datetime.now(). Isn't it strange that now doesn't have tz ? – Blusky May 08 '15 at 22:22
  • Ok, that did it, I used `datetime_rss = parser.parse(d.entries[0].published).replace(tzinfo=None)`. Thank you very much !!! – Blusky May 08 '15 at 22:51
  • @Blusky your solution is perfect if you can be confident that the timezone of the RSS feed is the same as your local timezone. If that's not the case (for example, you might be parsing an American feed whose timestamps are in EDT i.e. -0400), then you should account for the different timezones by converting everything to UTC. – oxymor0n May 09 '15 at 02:14
1

feedparser returns time in UTC timezone. It is incorrect to apply time.mktime() to it (unless your local timezone is UTC that it isn't). You should use calendar.timegm() instead:

import calendar
from datetime import datetime

utc_tuple = d.entries[0].published_parsed
posix_timestamp = calendar.timegm(utc_tuple)
local_time_as_naive_datetime_object = datetime.frometimestamp(posix_timestamp) # assume non-"right" timezone

RSS feeds may use many different dates formats; I would leave the date parsing to feedparser module.

If you want to get the local time as an aware datetime object:

from tzlocal import get_localzone # $ pip install tzlocal

local_timezone = get_localzone()
local_time = datetime.frometimestamp(posix_timestamp, local_timezone) # assume non-"right" timezone
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

Try this:

>>> import os
>>> os.environ['TZ'] = 'Europe/Paris'
>>> time.tzset()
>>> time.tzname
('CET', 'CEST')
Tuan Anh Hoang-Vu
  • 1,994
  • 1
  • 21
  • 32
  • I got the following error :`AttributeError: 'module' object has no attribute 'tzset'`. Apparently, tzset is only available for Linux. I am running on Windows, and would like the script to be multiplatform. Any other idea ? – Blusky May 08 '15 at 19:27