I need to pick one value per 30 day period for the entire dataframe. For instance if I have the following dataframe:
df:
Date Value
0 2015-09-25 e
1 2015-11-11 b
2 2015-11-24 c
3 2015-12-02 d
4 2015-12-14 a
5 2016-02-01 b
6 2016-03-23 c
7 2016-05-02 d
8 2016-05-25 a
9 2016-06-15 a
10 2016-06-28 a
I need to pick the first entry and then filter out any entry within the next 30 days of that entry and then proceed along the dataframe. For instance, indexes, 0 and 1 should stay since they are at least 30 days apart, but 2 and 3 are within 30 days of 1 so they should be removed. This should continue chronologically until we have 1 entry per 30 day period:
Date Value
0 2015-09-25 e
1 2015-11-11 b
4 2015-12-14 a
5 2016-02-01 b
6 2016-03-23 c
7 2016-05-02 d
9 2016-06-15 a
The end result should have only 1 entry per 30 day period. Any advice or assistance would be greatly appreciated!
I have tried df.groupby(pd.Grouper(freq='M')).first()
but that picks the first entry in each month rather than each entry that is at least 30 days from the previous entry.
I came up with a simple iterative solution which uses the fact that the DF is sorted, but its fairly slow:
index = df.index.values
dates = df['Date'].tolist()
index_to_keep = []
curr_date = None
for i in range(len(dates)):
if not curr_date or (dates[i] - curr_date).days > 30:
index_to_keep.append(index[i])
curr_date = dates[i]
df_out = df.loc[index_to_keep, :]
return df_out
Any ideas on how to speed it up?