1

I want to select data from a dataframe for a particular day of the year. Here is what I have so far as a minimal example.

import pandas as pd
from datetime import datetime 
from datetime import timedelta
import numpy.random as npr 
rng = pd.date_range('1/1/1990', periods=365*10, freq='D')
df1 = pd.DataFrame(npr.randn(len(rng)), index=rng)
print df1

That generates:

                   0
1990-01-01 -0.032601
1990-01-02 -0.496401
1990-01-03  0.444490

etc. Now I make a list of dates that I want to extract. I have used this before in pandas, but I suspect this is not the best way to get values for a particular date. Anyway,

td = timedelta(days=31)
dr = pd.date_range(datetime(1990,12,31)+td,datetime(2000,12,31),
                    freq=pd.DateOffset(months=12, days=0))
print dr

This, of course, generates:

DatetimeIndex(['1991-01-31', '1992-01-31', '1993-01-31', '1994-01-31',
               '1995-01-31', '1996-01-31', '1997-01-31', '1998-01-31',
               '1999-01-31', '2000-01-31'],
dtype='datetime64[ns]', freq='<DateOffset: kwds={'months': 12, 'days': 0}>', tz=None)

When I try to slice the dataframe by the list of dates, I generate an error:

monthly_df1 = df1[dr]

Output:

KeyError: "['1991-01-30T16:00:00.000000000-0800' '1992-01-30T16:00:00.000000000-0800'\n 
'1993-01-30T16:00:00.000000000-0800' '1994-01-30T16:00:00.000000000-0800'\n 
'1995-01-30T16:00:00.000000000-0800' '1996-01-30T16:00:00.000000000-0800'\n 
'1997-01-30T16:00:00.000000000-0800' '1998-01-30T16:00:00.000000000-0800'\n 
'1999-01-30T16:00:00.000000000-0800' '2000-01-30T16:00:00.000000000-0800'] 
not in index"

I think that I have two fundamental problems here: (1) there is a better way to extract yearly data for a particular date; and (2) the time series in the dataframe and date_range list are different. I would appreciate information on both problems. Thanks, community.

2 Answers2

3

You could use .ix to filter dr dates from df1

In [107]: df1.ix[dr]
Out[107]:
                   0
1991-01-31 -1.239096
1992-01-31  0.153730
1993-01-31 -0.685778
1994-01-31  0.132170
1995-01-31  0.154965
1996-01-31  1.800437
1997-01-31  2.725209
1998-01-31 -0.084751
1999-01-31  1.604511
2000-01-31       NaN

Even df1.loc[dr] works.


Also, for this case, you can just pass these conditions to extract the dates

In [108]: df1[(df1.index.month==1) & (df1.index.day==31)]
Out[108]:
                   0
1990-01-31 -0.362652
1991-01-31 -1.239096
1992-01-31  0.153730
1993-01-31 -0.685778
1994-01-31  0.132170
1995-01-31  0.154965
1996-01-31  1.800437
1997-01-31  2.725209
1998-01-31 -0.084751
1999-01-31  1.604511
Zero
  • 74,117
  • 18
  • 147
  • 154
  • Both of these work for my original post. For some reason, only the second works for the problem in my real code (though not for an arbitrary day of year). Perhaps I need to write a little function to generate the mask parameters for an arbitrary day of year. – marathonman4202 May 23 '15 at 16:55
  • Thanks @JohnGalt. This does work, and it is easy enough to create a mask with something like datetime.datetime(year, 1, 1) + datetime.timedelta(days - 1) (see [post here](http://stackoverflow.com/questions/2427555/python-question-year-and-day-of-year-to-date)) – marathonman4202 May 23 '15 at 17:29
1

The easiest method that you can use is to get the year's data with partial string indexing, then resample annually

df1['1990-12-31':].resample('A', how='first')

The string slicing is described in the pandas documentation under 17.4.1 DatetimeIndex Partial String Indexing. With this method, you can cut out the creation of the timedelta, the second date_range, and the complex and erroneous slicing. The resample method is standard, using 'A' as a signifier for "annual" frequency and how='first' to just grab the first matching item.

            0
1990-12-31 -0.600904
1991-12-31 -1.083462
1992-12-31  0.469949
1993-12-31 -0.809852
1994-12-31 -0.165877
1995-12-31  1.460035
1996-12-31 -0.332960
1997-12-31 -0.140873
1998-12-31  1.088687
1999-12-31  0.190218

Check out the pandas documentation, look into 17.4.1 DatetimeIndex Partial String Indexing Here is the TimeSeries documentation mentioning resampling And finally, the API doc for the resample() method

OYRM
  • 1,395
  • 10
  • 29
  • Ah, doesn't answer what OP has asked for. – Zero May 23 '15 at 14:32
  • 1
    It surely does @John Galt (1) there is a better way to extract yearly data for a particular date ? Yes, with partal string indexing, described above and (2) the time series in the dataframe and date_range list are different. I would appreciate information on both problems. My method avoids both problems – OYRM May 23 '15 at 14:34
  • May be you're right, I'll leave it to OP =) on a side note, year index is a good find, I didn't know about it's existence, Thanks. – Zero May 23 '15 at 14:36
  • 1
    I updated with evidence to show that the index is properly sliced with partial string indexing, hopefully that'll make the case more clearly – OYRM May 23 '15 at 14:38
  • I'm not sure if my post was clear. I am trying to get data out of the DataFrame at an annual level on a particular date. I do not want the entire year's data. I want data for a particular day of the year, every year. – marathonman4202 May 23 '15 at 16:43
  • I did not understand that, I'll need to rework it. Hold tight – OYRM May 23 '15 at 20:47