2

I have a list of data that has been read from MongoDB. A subset of the data can be found in this gist. I am creating a DataFrame from this list, using the Date fields to create a DatetimeIndex. The dates were recorded originally in my local timezone, but in Mongo they have no timezone information attached, so I correct for DST as advised here.

from datetime import datetime
from dateutil import tz

# data is the list from the gist
dates = [x['Date'] for x in data]
idx =  pd.DatetimeIndex(dates, freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz='Europe/Dublin')
idx = idx.normalize()
frame = DataFrame(data, index=idx)
frame = frame.drop('Date', 1)

everything seems to work fine, and my frame looks like this

                           Events         ID
2008-03-31 00:00:00+01:00     0.0  116927302
2008-03-30 00:00:00+00:00  2401.0  116927302
2008-03-31 00:00:00+01:00     0.0  116927307
2008-03-30 00:00:00+00:00     0.0  116927307
2008-03-31 00:00:00+01:00     0.0  121126919
2008-03-30 00:00:00+00:00  1019.0  121126919
2008-03-30 00:00:00+00:00     0.0  121126922
2008-03-31 00:00:00+01:00     0.0  121126922
2008-03-30 00:00:00+00:00     0.0  121127133
2008-03-31 00:00:00+01:00     0.0  121127133
2008-03-31 00:00:00+01:00     0.0  131677370
2008-03-30 00:00:00+00:00     0.0  131677370
2008-03-30 00:00:00+00:00     0.0  131677416
2008-03-31 00:00:00+01:00     0.0  131677416

Now I want to use both the original DatetimeIndex and the ID column to create a MultiIndex as shown here. When I try this, however, I get an error that wasn't raised when originally creating the DatetimeIndex

frame.set_index([frame.ID, idx])

NonExistentTimeError: 2008-03-30 01:00:00

If I just do frame.set_index(idx) without the MultiIndex, it raises no error

Versions

  • Python 2.7.11
  • Pandas 0.18.0
Community
  • 1
  • 1
Philip O'Brien
  • 4,146
  • 10
  • 46
  • 96

1 Answers1

1

You need sort_index first and then append column ID to index:

frame = frame.sort_index()
frame.set_index('ID', append=True, inplace=True)
print (frame)
                                     Events
                          ID               
2008-03-30 00:00:00+00:00 168445814     0.0
                          168445633     0.0
                          168445653     0.0
                          245514429     0.0
                          168445739     0.0
                          168445810     0.0
                          332955940     0.0
                          168445875     0.0
                          168445628     0.0
                          217596128  1779.0
                          177336685     0.0
                          180799848     0.0
                          215797757     0.0
                          180800351  1657.0
                          183192871     0.0
...
...     

If need another ordering of levels use DataFrame.swaplevel:

frame = frame.sort_index()
frame.set_index('ID', append=True, inplace=True)
frame = frame.swaplevel(0,1)
print (frame)
                                     Events
ID                                         
168445814 2008-03-30 00:00:00+00:00     0.0
168445633 2008-03-30 00:00:00+00:00     0.0
168445653 2008-03-30 00:00:00+00:00     0.0
245514429 2008-03-30 00:00:00+00:00     0.0
168445739 2008-03-30 00:00:00+00:00     0.0
168445810 2008-03-30 00:00:00+00:00     0.0
332955940 2008-03-30 00:00:00+00:00     0.0
168445875 2008-03-30 00:00:00+00:00     0.0
168445628 2008-03-30 00:00:00+00:00     0.0
217596128 2008-03-30 00:00:00+00:00  1779.0
177336685 2008-03-30 00:00:00+00:00     0.0
180799848 2008-03-30 00:00:00+00:00     0.0
215797757 2008-03-30 00:00:00+00:00     0.0
180800351 2008-03-30 00:00:00+00:00  1657.0
183192871 2008-03-30 00:00:00+00:00     0.0
186439064 2008-03-30 00:00:00+00:00     0.0
199856024 2008-03-30 00:00:00+00:00     0.0
...
...

If need copy column to index use set_index(frame.ID, ...:

frame = frame.sort_index()
frame.set_index(frame.ID, append=True, inplace=True)
frame = frame.swaplevel(0,1)
print (frame)
                                     Events         ID
ID                                                    
168445814 2008-03-30 00:00:00+00:00     0.0  168445814
168445633 2008-03-30 00:00:00+00:00     0.0  168445633
168445653 2008-03-30 00:00:00+00:00     0.0  168445653
245514429 2008-03-30 00:00:00+00:00     0.0  245514429
168445739 2008-03-30 00:00:00+00:00     0.0  168445739
168445810 2008-03-30 00:00:00+00:00     0.0  168445810
332955940 2008-03-30 00:00:00+00:00     0.0  332955940
168445875 2008-03-30 00:00:00+00:00     0.0  168445875
168445628 2008-03-30 00:00:00+00:00     0.0  168445628
217596128 2008-03-30 00:00:00+00:00  1779.0  217596128
177336685 2008-03-30 00:00:00+00:00     0.0  177336685
180799848 2008-03-30 00:00:00+00:00     0.0  180799848
215797757 2008-03-30 00:00:00+00:00     0.0  215797757
180800351 2008-03-30 00:00:00+00:00  1657.0  180800351
183192871 2008-03-30 00:00:00+00:00     0.0  183192871
186439064 2008-03-30 00:00:00+00:00     0.0  186439064
...
...                     
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • That's great thanks. How come the `sort_index` is necessary? – Philip O'Brien Jul 21 '16 at 11:34
  • 1
    Hmmm, I think many function in pandas need it. You can check [docs](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex) - `While pandas does not force you to have a sorted date index, some of these methods may have unexpected or incorrect behavior if the dates are unsorted. So please be careful.` – jezrael Jul 21 '16 at 11:38
  • That's interesting, thanks for the great answer and insightful comment (not to mention the speed of reply!) – Philip O'Brien Jul 21 '16 at 11:39