7

I am having some difficulty working with times/timezones. I have raw JSON data of the form

{
  "Date": "28 Sep 2009 00:00:00",
  ....
}

This data is then loaded into MongoDB, and this string representation of the date is transformed into a JavaScript Date object. This conversion to UTC time results in the following date

{
  "_id": ObjectId("577a788f4439e17afd4e21f7"),
  "Date": ISODate("2009-09-27T23:00:00Z")
}

It "looks" as though the date has actually been moved forward a day, I'm assuming (perhaps incorrectly) that this is because my machine is set to Irish Standard Time.

I then read this data from MongoDB and use it to create a pandas DatetimeIndex

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')

which gives me

enter image description here

which is incorrect since the time has not been converted back correctly from UTC to local time. So I followed the solution given in this answer

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz=tz.tzlocal())
frame = DataFrame(test_docs, index=idx)
frame = frame.drop('Date', 1)

which gives me the right day back

enter image description here

I then normalize the DatetimeIndex so the hours are removed, allowing me to group all entries by day.

frame.groupby(idx).sum()

At this point, however, something strange happens. The dates end up getting grouped as follows

enter image description here

but this doesn't reflect the dates in the frame

enter image description here

Can anyone shed some light on where I might be going wrong?


Response to @ptrj

Explicitly using my timezone as a string

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz='Europe/Dublin')
idx = idx.normalize()
frame = DataFrame(test_docs, index=idx)
...
...
aggregate = frame.groupby(idx).sum()
aggregate.plot()

this doesn't work for me, it results in the following plot

enter image description here

For some reason the groupby is not properly grouping for 2014, as shown below

enter image description here

If instead, I use

idx = idx.tz_convert(tz.gettz('Europe/Dublin'))

I get the same problem

Convert to an object

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz=tz.tzlocal())
idx = idx.normalize()
frame = DataFrame(test_docs, index=idx)
aggregate = frame.groupby(idx.astype(object)).sum()

This approach seems to work correctly for me

enter image description here

Community
  • 1
  • 1
Philip O'Brien
  • 4,146
  • 10
  • 46
  • 96
  • Is `idx` a column in `frame` or a separate index? It looks like `idx` is not aligned with `frame.index` (`idx[0]` doesn't match the index in `frame` you pasted - if it's the real data). If you could paste a small example with `frame` and `idx` that causes this error, it would be helpful. – ptrj Jul 08 '16 at 18:26
  • Sorry, I should have made it more clear. `idx` is the index of `frame`. I updated the question to reflect this – Philip O'Brien Jul 09 '16 at 15:11
  • Hmm, the error with timezones is strange. I can't reproduce it. Maybe it's another bug. – ptrj Jul 12 '16 at 19:24

2 Answers2

3

I was able to reproduce the error with the following data:

idx0 = pd.date_range('2011-11-11', periods=4)
idx1 = idx0.tz_localize(tz.tzutc())
idx2 = idx1.tz_convert(tz.tzlocal())
df = pd.DataFrame([1, 2, 3, 4])

df.groupby(idx2).sum()
Out[20]: 
                           0
1970-01-01 00:00:00-05:00  9
2011-11-10 19:00:00-05:00  1

It's a bug deep in the pandas code, related exclusively to tz.tzlocal(). It manifests itself also in:

idx2.tz_localize(None)
Out[27]: 
DatetimeIndex(['2011-11-10 19:00:00', '1970-01-01 00:00:00',
               '1970-01-01 00:00:00', '1970-01-01 00:00:00'],
              dtype='datetime64[ns]', freq='D')

You can use any of the following solutions:

  • use explicitly your timezone as a string:

    idx2 = idx1.tz_convert(tz='Europe/Dublin')
    df.groupby(idx2).sum()
    Out[29]: 
                               0
    2011-11-11 00:00:00+00:00  1
    2011-11-12 00:00:00+00:00  2
    2011-11-13 00:00:00+00:00  3
    2011-11-14 00:00:00+00:00  4
    

    or if it doesn't work:

    idx2 = idx1.tz_convert(tz.gettz('Europe/Dublin'))
    
  • convert it to an object:

    df.groupby(idx2.astype(object)).sum()
    Out[32]: 
                               0
    2011-11-10 19:00:00-05:00  1
    2011-11-11 19:00:00-05:00  2
    2011-11-12 19:00:00-05:00  3
    2011-11-13 19:00:00-05:00  4
    

Basically, converting to anything else than DatetimeIndex with tz=tz.local() should work.


EDIT: This bug has been just fixed on pandas github. The fix will be available in pandas 0.19 release.

ptrj
  • 5,152
  • 18
  • 31
0

I have managed to get around this for now by changing my groupby to the following

frame.groupby([pd.DatetimeIndex([x.date() for x in frame.index])]).sum()

so where I was originally trying to groupby

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz=tz.tzlocal())
frame.groupby(idx).sum()

I am now calling the date method on each element of the index prior to performing the groupby operation.

I'm posting this as an answer in case nobody replies, but I am hoping for someone to answer and explain what is happening, as my 'solution' seems too hacky for my tastes.

Philip O'Brien
  • 4,146
  • 10
  • 46
  • 96