I have a dataframe that looks like this:
id start end diff mindiff
1 2015-01-02 2015-07-01 180 57
2 2015-02-03 2015-05-12 98 56
3 2015-01-15 2015-01-20 5 5
4 2015-02-04 2015-04-15 70 55
5 2015-03-15 2015-05-01 47 46
6 2015-02-22 2015-03-01 7 7
7 2015-03-21 2015-04-12 22 22
8 2015-04-11 2015-06-15 65 50
9 2015-04-11 2015-05-01 20 20
10 2015-03-30 2015-04-01 2 2
11 2015-04-28 2015-06-15 48 33
12 2015-05-01 2015-06-01 31 31
13 2015-05-10 2015-06-09 30 30
14 2015-05-19 2015-07-01 43 42
15 2015-06-01 2015-06-06 5 5
16 2015-06-02 2015-06-29 27 27
17 2015-04-29 2015-05-21 22 22
18 2015-05-25 2015-07-01 37 36
19 2015-06-04 2015-06-26 22 22
20 2015-06-21 2015-07-01 10 10
21 2015-05-30 2015-06-06 7 7
22 2015-06-30 2015-07-01 1 1
The fields are id, start (date), end (date), diff(the number of days between start and end), mindiff(the min(diff and the last day x months from start).
x in this case is 1 (so one month "later than" the start date)
What I am trying to accomplish is to find the Average(mean) of mindiff, grouped by the year/month of 'end', but ONLY averaging for each group the records which have a 'start' year/month x (defined above) months back through the groupedby
month. Example from the data set above, id 1 would only be averaged in year/month 2015/1 and 2015/1+x (2015/2).
Here is a table flagging each record and which month I would like to average in:
Months
id 1 2 3 4 5 6 7
1 1 1
2 1 1
3 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 1
16 1
17 1 1
18 1 1
19 1
20 1 1
21 1 1
22 1 1
Here are the mindiffs and resulting AVG/month I am looking for:
Months
id 1 2 3 4 5 6 7
1 57 57
2 56 56
3 5
4 55 55
5 46 46
6 7 7
7 22 22
8 50 50
9 20 20
10 2 2
11 33 33
12 31 31
13 30 30
14 42 42
15 5
16 27
17 22 22
18 36 36
19 22
20 10 10
21 7 7
22 1 1
AVG 31 43.8 31.3 27.9 30.1 21.1 5.5
Finally, here is the dataframe I am looking for as a result:
Month Avg Diff Trailing x months
2015-01 31
2015-02 43.75
2015-03 31.33333333
2015-05 27.85714286
2015-05 30.11111111
2015-06 21.1
2015-07 5.5
I know this is possible with a loop, but my gut says that a GROUPBY is more pythonic and likely more efficient. But how do I get only the specific rolling mindiff values for months of 'start' to be averaged within the groupby of the 'end' year/month. Thanks for the help.