Create flag based on cumsum and timediff

Question

Consider the following data frame,

import pandas as pd
import numpy as np

np.random.seed(666)
dd=pd.DataFrame({'v1': np.random.choice(range(30), 20),
                 'v2': np.random.choice(pd.date_range(
                       '5/3/2016', periods=365, freq='D'),
                     20, replace=False)
                 })
dd=dd.sort_values('v2')

#    v1         v2
#5    4 2016-05-03
#11  14 2016-05-26
#19  12 2016-06-26
#15   8 2016-07-06
#7   27 2016-08-04
#4    9 2016-08-28
#17   5 2016-09-08
#13  16 2016-10-04
#14  14 2016-10-10
#18  18 2016-11-25
#3    6 2016-12-03
#8   19 2016-12-04
#12   1 2016-12-12
#10  28 2017-01-14
#1    2 2017-02-12
#0   12 2017-02-15
#9   28 2017-03-11
#6   29 2017-03-18
#16   7 2017-03-21
#2   13 2017-04-29

I want to create groups that will be based on the following two conditions:

cumulative sum of v1 <= 40
Or time difference of v2 <= 61 days

In other words, each group must have either sum of 40 v1 or 2 months time. So If 61 days go by but the 40 is not completed, then close the group anyway. If the 40 is completed in say 1 day, again close the group

In the end the flag would be,

dd['expected_flag']=[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9]

I have asked a very very similar question in R here but there is a new requirement now (date) that I can't quite get my head around it.

NOTE I will be running this in huge data sets so the more efficient the better

EDIT: I found this question which basically takes care of the first condition but not the date condition

EDIT 2: The 61 days time difference is merely to indicate the time constrain. In reality that constrain will be in minutes

EDIT 3: Using the function provided by @Maarten, I get the following (first 40 rows) where group 1 should also include the first 2 from group 2 (i.e. where v1=6 and v1=6).

Out[330]: 
    index                  v2  v1  max_limit       group
0       2 2017-04-01 00:00:02  14      335.0        1
1       3 2017-04-01 00:00:03   8      335.0        1
2      13 2017-04-01 00:00:13  11      335.0        1
3      14 2017-04-01 00:00:14  11      335.0        1
4      29 2017-04-01 00:00:29   4      335.0        1
5      44 2017-04-01 00:00:44  16      335.0        1
6      52 2017-04-01 00:00:52  10      335.0        1
7      58 2017-04-01 00:00:58  11      335.0        1
8      65 2017-04-01 00:01:05  15      335.0        1
9      68 2017-04-01 00:01:08   8      335.0        1
10     81 2017-04-01 00:01:21  12      335.0        1
11     98 2017-04-01 00:01:38   9      335.0        1
12    102 2017-04-01 00:01:42   7      335.0        1
13    107 2017-04-01 00:01:47  12      335.0        1
14    113 2017-04-01 00:01:53   6      335.0        1
15    116 2017-04-01 00:01:56   6      335.0        1
16    121 2017-04-01 00:02:01   4      335.0        1
17    128 2017-04-01 00:02:08  16      335.0        1
18    143 2017-04-01 00:02:23   7      335.0        1
19    149 2017-04-01 00:02:29  11      335.0        1
20    163 2017-04-01 00:02:43   4      335.0        1
21    185 2017-04-01 00:03:05   9      335.0        1
22    239 2017-04-01 00:03:59   6      335.0        1
23    242 2017-04-01 00:04:02  13      335.0        1
24    272 2017-04-01 00:04:32   4      335.0        1
25    293 2017-04-01 00:04:53   8      335.0        1
26    301 2017-04-01 00:05:01  10      335.0        1
27    302 2017-04-01 00:05:02   7      335.0        1
28    305 2017-04-01 00:05:05  12      335.0        1
29    323 2017-04-01 00:05:23   5      335.0        1
30    326 2017-04-01 00:05:26  13      335.0        1
31    329 2017-04-01 00:05:29  10      335.0        1
32    365 2017-04-01 00:06:05  10      335.0        1
33    368 2017-04-01 00:06:08  11      335.0        1
34    411 2017-04-01 00:06:51   6      335.0        2
35    439 2017-04-01 00:07:19   6      335.0        2
36    440 2017-04-01 00:07:20   8      335.0        2
37    466 2017-04-01 00:07:46   7      335.0        2
38    475 2017-04-01 00:07:55   4      335.0        2
39    489 2017-04-01 00:08:09   4      335.0        2

So to make it clear, when I sum and calculate the timediff I get,

dd.groupby('group', as_index=False).agg({'v1': 'sum', 'v2': lambda x: max(x)-min(x)})
Out[332]: 
#      group   v1       v2
#0         1  320 00:06:06
#1         2  326 00:07:34
#2         3  330 00:06:53
#...

I dont understand - there is still time difference < 2 months and all values are < 40 in `v1` - What is logic? — jezrael, Sep 07 '17 at 12:54
Exactly that. Each group must have either 40 v1 or 2 months time. So If 2 months go by but the 40 is not completed, then close the group anyway. If the 40 is completed in say 1 day, again close the group — Sotos, Sep 07 '17 at 12:56
So the time interval is also cumulative. Not just the difference between the successive rows? Difference between the last and the first element in that group? — ayhan, Sep 07 '17 at 12:58
@IanS I 'd rather not. The more efficient the better. Actually I need to mention this in my question — Sotos, Sep 07 '17 at 12:59
@IanS I probably messed up the output (happens when you do things manually) :p -- I ll revise — Sotos, Sep 07 '17 at 13:07
If you can rephrase the allowed timedelta in days instead of months, that would make things easier; (I see @IanS beat me to the question) — Maarten Fabré, Sep 08 '17 at 08:50
`dd['group'] = pd.Series(my_grouping_arbitrary_interval(dd, diff_v1 = 335, interval=pd.Timedelta(6, 'm')))` works for me as expected — Maarten Fabré, Sep 08 '17 at 13:25
I did with `15 minutes` and got the output above which suggests that at least two more rows could be added to group 1 — Sotos, Sep 08 '17 at 13:26

score 3 · Answer 1 · answered Sep 08 '17 at 08:58

3

Setup:

dd['days'] = dd['v2'].diff().dt.days.fillna(0).astype(int)
dd = dd[['v1', 'v2', 'days']]  # the order of the columns matters

Initialize:

increment = pd.Series(False, index=dd.index)
v1_cum = 0
days_cum = 0

Loop:

for row in dd.itertuples(name=None):  # faster than iterrows
    v1_cum += row[1]
    days_cum += row[3]
    if v1_cum > 40 or days_cum > 61:
        increment[row[0]] = True  # first element of tuple is index
        # notice the different re-initialization
        v1_cum = row[1]
        days_cum = 0

Assign:

dd['flag'] = increment.cumsum() + 1

Output:

[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9]

answered Sep 08 '17 at 08:58

IanS

15,771
9
60
84

you can use `row.v1`, `row.days`, `row.Index` instead of the `row[1]`,... – Maarten Fabré Sep 08 '17 at 09:03
@MaartenFabré I had an intuition that using unnamed tuples could be a tiny bit faster... – IanS Sep 08 '17 at 09:11
this seems to be working. Let me try it and let you know – Sotos Sep 08 '17 at 09:12
One disadvantage this method has for me is that it modifies the initial DataFrame, not only with the actcual result, but also with an intermediary column. It should be quite easy to adapt it not to do this – Maarten Fabré Sep 08 '17 at 09:24
Can `diff` calculate seconds or minutes If my dates are like `2017-04-01 00:00:02`? – Sotos Sep 08 '17 at 09:28
Yes: `dd['v2'].diff().dt.total_seconds()`. – IanS Sep 08 '17 at 09:34

Maarten Fabré · Accepted Answer · 2017-09-08T14:56:36.463

Slightly different approach from @IanS. I don't know which will be faster. This one actually calculates the difference in months

def diff_in_months(date1, date2):
    import itertools
#     print(date1, date2)
    x, y = max(date1, date2), min(date1, date2)
    coefficients = 12, 100, 24, 100, 100, 1
    coefficients = list(reversed([i for i in itertools.accumulate(reversed(coefficients), operator.mul)]))

    return (sum(i * j for i, j in zip(coefficients, x.timetuple())) - sum(i * j for i, j in zip(coefficients, y.timetuple()))) // coefficients[1]

This can be sped up slightly by calculating the coefficients (and using a global variable) only once instead of on every call of the method

def my_grouping(df):
    i = 1
    v1 = 0
    v2 = df['v2'].iloc[0]
    for row in df.itertuples():
#         print(row)
        if diff_in_months(v2, row.v2) >= 2 or (v1 + row.v1 >= 41):
            i += 1
            v1 = row.v1
            v2 = row.v2
        else:
            v1 += row.v1
        yield i

flag_series = pd.Series(my_grouping(dd), index = dd.index))

dd.assign(flag=flag_series, expected_flag = [1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9])

    v1  v2  expected_flag   flag
5   4   2016-05-03  1   1
11  14  2016-05-26  1   1
19  12  2016-06-26  1   1
15  8   2016-07-06  2   2
7   27  2016-08-04  2   2
4   9   2016-08-28  3   3
17  5   2016-09-08  3   3
13  16  2016-10-04  3   3
14  14  2016-10-10  4   4
18  18  2016-11-25  4   4
3   6   2016-12-03  4   4
8   19  2016-12-04  5   5
12  1   2016-12-12  5   5
10  28  2017-01-14  6   6
1   2   2017-02-12  6   6
0   12  2017-02-15  7   7
9   28  2017-03-11  7   7
6   29  2017-03-18  8   8
16  7   2017-03-21  8   8
2   13  2017-04-29  9   9

Arbitrary interval

def my_grouping_arbitrary_interval(df, diff_v1 = 41, interval = pd.Timedelta(61, 'D')):
    i = 1
    v1 = 0
    v2 = df['v2'].iloc[0]
    for row in df.itertuples():
#         print(row)
        if max(v2, row.v2) - min(v2, row.v2) >= interval or (v1 + row.v1 >= diff_v1):
            i += 1
            v1 = row.v1
            v2 = row.v2
        else:
            v1 += row.v1
        yield i

The problem with this is that pd.Timedelta takes any of thes unit : string, [D,h,m,s,ms,us,ns] as input, so no months or years. For those you will have to adapt my diff_in_months

Thank you @Maarten. Can this be scaled for say seconds instead of months? In the end a function with arguments `data frame`, `sum of v1` and `time interval` would be ideal...something like `f(df, count = 40, time = 10mins)` — Sotos, Sep 08 '17 at 09:28
Yes!!! That would be the function. However, It does not return anything. It gives `Out[290]: `. Sorry If I sound a bit retarded. I m very new to python (I come from R background) — Sotos, Sep 08 '17 at 09:44
that is because it is a generator. It does not return a dump of data in 1 shot, but one value at a time. You need to use it slightly differently: `flag_series = pd.Series(my_grouping_arbitrary_interval(dd), index = dd.index))` — Maarten Fabré, Sep 08 '17 at 09:47
Seems to be working. Thanks! I ll run it on the entire dataset and double check before I accept your answer. Thanks again — Sotos, Sep 08 '17 at 11:37
I m trying to make the example that failed reproducible and edit my Q. I basically specified 335 v1 and 15 minutes window and the first group stopped at 320 (next value was 6) and at 6 minutes(next was in few seconds) — Sotos, Sep 08 '17 at 12:15
I updated my question with part of the real data set where it fails. — Sotos, Sep 08 '17 at 12:26

Create flag based on cumsum and timediff

2 Answers2

Arbitrary interval