0

I need to pick one value per 30 day period for the entire dataframe. For instance if I have the following dataframe:

df:

   Date        Value
0  2015-09-25   e
1  2015-11-11   b
2  2015-11-24   c
3  2015-12-02   d
4  2015-12-14   a
5  2016-02-01   b
6  2016-03-23   c 
7  2016-05-02   d 
8  2016-05-25   a  
9  2016-06-15   a  
10  2016-06-28  a  

I need to pick the first entry and then filter out any entry within the next 30 days of that entry and then proceed along the dataframe. For instance, indexes, 0 and 1 should stay since they are at least 30 days apart, but 2 and 3 are within 30 days of 1 so they should be removed. This should continue chronologically until we have 1 entry per 30 day period:

   Date        Value
0  2015-09-25   e
1  2015-11-11   b
4  2015-12-14   a
5  2016-02-01   b
6  2016-03-23   c 
7  2016-05-02   d  
9  2016-06-15   a  

The end result should have only 1 entry per 30 day period. Any advice or assistance would be greatly appreciated!

I have tried df.groupby(pd.Grouper(freq='M')).first() but that picks the first entry in each month rather than each entry that is at least 30 days from the previous entry.

I came up with a simple iterative solution which uses the fact that the DF is sorted, but its fairly slow:

index = df.index.values
dates = df['Date'].tolist()
index_to_keep = []

curr_date = None
for i in range(len(dates)):
    if not curr_date or (dates[i] - curr_date).days > 30:
        index_to_keep.append(index[i])
        curr_date = dates[i]

df_out = df.loc[index_to_keep, :]
return df_out

Any ideas on how to speed it up?

  • What have you tried so far, and what went wrong with your attempts? Please [edit] your question to include a [mcve] with _code_ so that we can better understand how to help – G. Anderson Feb 23 '21 at 21:39
  • @G.Anderson, thanks for the comment. I guess my issue is that I dont really even know where to start. I know that I could probably do it with looping, but that would be very slow given that this dataframe has 3 million+ rows. – Mac Jacobson Feb 23 '21 at 21:58
  • This is basically a cumsum() with a conditional reset every 30 days. [And the experts (I would really trust these three users) say there is no vectorized solution.](https://stackoverflow.com/q/56904390/8881141) – Mr. T Mar 01 '21 at 12:18

1 Answers1

0

I think this should be what you are looking for.

You need to transform your date column into a datetime datastructure to not be interpreted as a string.

here is what it looks like:

df = pd.DataFrame({'Date': ['2015-09-25', '2015-11-11','2015-11-24', '2015-12-02','2015-12-14'],
                   'Value' : ['e', 'b', 'c','d','a']})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')

df = df.groupby(pd.Grouper(freq='30D')).nth(0)

and here is the result

           Value
Date            
2015-09-25     e
2015-10-25     b
2015-11-24     c
Murcielago
  • 905
  • 1
  • 8
  • 30
  • Unfortunately, I dont think this is what I need, or at least not completely. I need to keep the entries that are already in the dataframe. It seems like your solution took the first date and added 30 days to it over and over. For instance, the second entry in your result is ```2015-09-25``` However that entry does not exist in the original dataframe. Also as you can see by your "value" column the third entry in your result is different from the third entry in the desired result. – Mac Jacobson Feb 23 '21 at 23:54
  • you are right, grouper is not the solution there. Don't know what's built in pandas that could achieve what you want – Murcielago Feb 24 '21 at 08:38