Python Pandas Dataframe - Groupby and Average based on Condition

Question

I have a dataframe that looks like this:

id  start       end         diff mindiff
1   2015-01-02  2015-07-01  180 57
2   2015-02-03  2015-05-12  98  56
3   2015-01-15  2015-01-20  5   5
4   2015-02-04  2015-04-15  70  55
5   2015-03-15  2015-05-01  47  46
6   2015-02-22  2015-03-01  7   7
7   2015-03-21  2015-04-12  22  22
8   2015-04-11  2015-06-15  65  50
9   2015-04-11  2015-05-01  20  20
10  2015-03-30  2015-04-01  2   2
11  2015-04-28  2015-06-15  48  33
12  2015-05-01  2015-06-01  31  31
13  2015-05-10  2015-06-09  30  30
14  2015-05-19  2015-07-01  43  42
15  2015-06-01  2015-06-06  5   5
16  2015-06-02  2015-06-29  27  27
17  2015-04-29  2015-05-21  22  22
18  2015-05-25  2015-07-01  37  36
19  2015-06-04  2015-06-26  22  22
20  2015-06-21  2015-07-01  10  10
21  2015-05-30  2015-06-06  7   7
22  2015-06-30  2015-07-01  1   1

The fields are id, start (date), end (date), diff(the number of days between start and end), mindiff(the min(diff and the last day x months from start).

x in this case is 1 (so one month "later than" the start date)

What I am trying to accomplish is to find the Average(mean) of mindiff, grouped by the year/month of 'end', but ONLY averaging for each group the records which have a 'start' year/month x (defined above) months back through the groupedby month. Example from the data set above, id 1 would only be averaged in year/month 2015/1 and 2015/1+x (2015/2).

Here is a table flagging each record and which month I would like to average in:

    Months                      
id  1   2   3   4   5   6   7
1   1   1                   
2       1   1               
3   1                       
4       1   1               
5           1   1           
6       1   1               
7           1   1           
8               1   1       
9               1   1       
10          1   1           
11              1   1       
12                  1   1   
13                  1   1   
14                  1   1   
15                      1   
16                      1   
17              1   1       
18                  1   1   
19                      1   
20                      1   1
21                  1   1   
22                      1   1

Here are the mindiffs and resulting AVG/month I am looking for:

    Months                      
id  1   2   3   4   5   6   7
1   57  57                  
2       56  56              
3   5                       
4       55  55              
5           46  46          
6       7   7               
7           22  22          
8               50  50      
9               20  20      
10          2   2           
11              33  33      
12                  31  31  
13                  30  30  
14                  42  42  
15                      5   
16                      27  
17              22  22      
18                  36  36  
19                      22  
20                      10  10
21                  7   7   
22                      1   1
AVG 31  43.8    31.3    27.9    30.1    21.1    5.5

Finally, here is the dataframe I am looking for as a result:

Month   Avg Diff Trailing x months
2015-01 31
2015-02 43.75
2015-03 31.33333333
2015-05 27.85714286
2015-05 30.11111111
2015-06 21.1
2015-07 5.5

I know this is possible with a loop, but my gut says that a GROUPBY is more pythonic and likely more efficient. But how do I get only the specific rolling mindiff values for months of 'start' to be averaged within the groupby of the 'end' year/month. Thanks for the help.

jezrael · Accepted Answer · 2015-11-02T19:48:54.807

First I created testing data with different years and start in last row is set to December. Then I convert start and end columns to periods - periodS and periodE cols.

I use function groupby by column month and count mean from column Avg:

g = df1.groupby('months')['Avg'].mean().reset_index()

import pandas as pd
import numpy as np
import io

temp=u"""id;start;end
1;2014-01-02;2014-07-01
2;2014-02-03;2014-05-12
3;2014-01-15;2014-01-20
4;2014-02-04;2014-04-15
5;2014-03-15;2014-05-01
6;2014-02-22;2014-03-01
7;2015-03-21;2015-04-12
8;2015-04-11;2015-06-15
9;2015-04-11;2015-05-01
10;2015-03-30;2015-04-01
11;2015-04-28;2015-06-15
12;2015-05-01;2015-06-01
13;2015-05-10;2015-06-09
14;2016-05-19;2016-07-01
15;2016-06-01;2016-06-06
16;2016-06-02;2016-06-29
17;2016-04-29;2016-05-21
18;2016-05-25;2016-07-01
19;2017-06-04;2017-06-26
20;2017-06-21;2017-07-01
21;2017-05-30;2017-06-06
22;2017-12-30;2018-02-01"""

df = pd.read_csv(io.StringIO(temp), sep=";", index_col=[0])
print df
def last_day_of_next_month(any_day):
    next_month = any_day.replace(day=28) + pd.Timedelta(days=36)  # this will never fail
    return next_month - pd.Timedelta(days=next_month.day)

df['mindiff'] = (pd.to_datetime(df['start']).apply(last_day_of_next_month) - pd.to_datetime(df['start'])).astype('timedelta64[D]')
df['diff'] = (pd.to_datetime(df['end']) - pd.to_datetime(df['start'])).astype('timedelta64[D]')
df['mindiff'] = df[['mindiff', 'diff']].apply(lambda x: min(x), axis=1)
#print df

#set day of start and end to periodindex
df['periodS'] =  pd.to_datetime(df['start']).dt.to_period('M')
df['periodE'] =  pd.to_datetime(df['end']).dt.to_period('M')

#if period end is higher as period start, add one month else NaN
df['period'] = np.where(df['periodE'] > df['periodS'],df['periodS'] + 1, np.nan)
#print df
#df from subset
df1 = df[['mindiff', 'periodS', 'period']]
#pivot data (from rows to columns)
df1 = df1.set_index('mindiff').stack().reset_index()
#rename columns names
df1.columns = ['Avg', 'tmp', 'months']
#groupby by column month and count mean from column Avg
g = df1.groupby('months')['Avg'].mean().reset_index()
print g
#     months        Avg
#0   2014-01  31.000000
#1   2014-02  43.750000
#2   2014-03  41.000000
#3   2014-04  46.000000
#4   2015-03  12.000000
#5   2015-04  25.400000
#6   2015-05  32.800000
#7   2015-06  30.500000
#8   2016-04  22.000000
#9   2016-05  33.333333
#10  2016-06  27.500000
#11  2017-05   7.000000
#12  2017-06  13.000000
#13  2017-07  10.000000
#14  2017-12  32.000000
#15  2018-01  32.000000

@ jezreal - Thanks. This looks promising. I will go through it and come back. Would it be more pythonic to use resample for the months? (As opposed to dt.month columns — clg4, Oct 19 '15 at 22:59
are you saying your way is an improvement over resample('M')? Also, this will not work if the dates span multiple years as your assumption to add 1 to month will blow up going from December to January. Any thoughts on that as well? THanks for the help. — clg4, Oct 31 '15 at 18:06
Awesome. Like the handling of the next month...You could also add 2 months and subtract a day? — clg4, Nov 07 '15 at 16:39
Yes, it is function for testing data, you think `next_month = any_day.replace(day=28) + pd.Timedelta(days=34)` ? I don't know exactly, what you mean add 2 months, sorry. I try to modify [this](http://stackoverflow.com/a/13565185/2901002), but it isn't tested. — jezrael, Nov 07 '15 at 17:12

Python Pandas Dataframe - Groupby and Average based on Condition

1 Answers1