3

Consider the following data frame,

import pandas as pd
import numpy as np

np.random.seed(666)
dd=pd.DataFrame({'v1': np.random.choice(range(30), 20),
                 'v2': np.random.choice(pd.date_range(
                       '5/3/2016', periods=365, freq='D'),
                     20, replace=False)
                 })
dd=dd.sort_values('v2')

#    v1         v2
#5    4 2016-05-03
#11  14 2016-05-26
#19  12 2016-06-26
#15   8 2016-07-06
#7   27 2016-08-04
#4    9 2016-08-28
#17   5 2016-09-08
#13  16 2016-10-04
#14  14 2016-10-10
#18  18 2016-11-25
#3    6 2016-12-03
#8   19 2016-12-04
#12   1 2016-12-12
#10  28 2017-01-14
#1    2 2017-02-12
#0   12 2017-02-15
#9   28 2017-03-11
#6   29 2017-03-18
#16   7 2017-03-21
#2   13 2017-04-29

I want to create groups that will be based on the following two conditions:

  1. cumulative sum of v1 <= 40
  2. Or time difference of v2 <= 61 days

In other words, each group must have either sum of 40 v1 or 2 months time. So If 61 days go by but the 40 is not completed, then close the group anyway. If the 40 is completed in say 1 day, again close the group

In the end the flag would be,

dd['expected_flag']=[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9]

I have asked a very very similar question in R here but there is a new requirement now (date) that I can't quite get my head around it.

NOTE I will be running this in huge data sets so the more efficient the better

EDIT: I found this question which basically takes care of the first condition but not the date condition

EDIT 2: The 61 days time difference is merely to indicate the time constrain. In reality that constrain will be in minutes

EDIT 3: Using the function provided by @Maarten, I get the following (first 40 rows) where group 1 should also include the first 2 from group 2 (i.e. where v1=6 and v1=6).

Out[330]: 
    index                  v2  v1  max_limit       group
0       2 2017-04-01 00:00:02  14      335.0        1
1       3 2017-04-01 00:00:03   8      335.0        1
2      13 2017-04-01 00:00:13  11      335.0        1
3      14 2017-04-01 00:00:14  11      335.0        1
4      29 2017-04-01 00:00:29   4      335.0        1
5      44 2017-04-01 00:00:44  16      335.0        1
6      52 2017-04-01 00:00:52  10      335.0        1
7      58 2017-04-01 00:00:58  11      335.0        1
8      65 2017-04-01 00:01:05  15      335.0        1
9      68 2017-04-01 00:01:08   8      335.0        1
10     81 2017-04-01 00:01:21  12      335.0        1
11     98 2017-04-01 00:01:38   9      335.0        1
12    102 2017-04-01 00:01:42   7      335.0        1
13    107 2017-04-01 00:01:47  12      335.0        1
14    113 2017-04-01 00:01:53   6      335.0        1
15    116 2017-04-01 00:01:56   6      335.0        1
16    121 2017-04-01 00:02:01   4      335.0        1
17    128 2017-04-01 00:02:08  16      335.0        1
18    143 2017-04-01 00:02:23   7      335.0        1
19    149 2017-04-01 00:02:29  11      335.0        1
20    163 2017-04-01 00:02:43   4      335.0        1
21    185 2017-04-01 00:03:05   9      335.0        1
22    239 2017-04-01 00:03:59   6      335.0        1
23    242 2017-04-01 00:04:02  13      335.0        1
24    272 2017-04-01 00:04:32   4      335.0        1
25    293 2017-04-01 00:04:53   8      335.0        1
26    301 2017-04-01 00:05:01  10      335.0        1
27    302 2017-04-01 00:05:02   7      335.0        1
28    305 2017-04-01 00:05:05  12      335.0        1
29    323 2017-04-01 00:05:23   5      335.0        1
30    326 2017-04-01 00:05:26  13      335.0        1
31    329 2017-04-01 00:05:29  10      335.0        1
32    365 2017-04-01 00:06:05  10      335.0        1
33    368 2017-04-01 00:06:08  11      335.0        1
34    411 2017-04-01 00:06:51   6      335.0        2
35    439 2017-04-01 00:07:19   6      335.0        2
36    440 2017-04-01 00:07:20   8      335.0        2
37    466 2017-04-01 00:07:46   7      335.0        2
38    475 2017-04-01 00:07:55   4      335.0        2
39    489 2017-04-01 00:08:09   4      335.0        2 

So to make it clear, when I sum and calculate the timediff I get,

dd.groupby('group', as_index=False).agg({'v1': 'sum', 'v2': lambda x: max(x)-min(x)})
Out[332]: 
#      group   v1       v2
#0         1  320 00:06:06
#1         2  326 00:07:34
#2         3  330 00:06:53
#...
Sotos
  • 51,121
  • 6
  • 32
  • 66

2 Answers2

3

Setup:

dd['days'] = dd['v2'].diff().dt.days.fillna(0).astype(int)
dd = dd[['v1', 'v2', 'days']]  # the order of the columns matters

Initialize:

increment = pd.Series(False, index=dd.index)
v1_cum = 0
days_cum = 0

Loop:

for row in dd.itertuples(name=None):  # faster than iterrows
    v1_cum += row[1]
    days_cum += row[3]
    if v1_cum > 40 or days_cum > 61:
        increment[row[0]] = True  # first element of tuple is index
        # notice the different re-initialization
        v1_cum = row[1]
        days_cum = 0

Assign:

dd['flag'] = increment.cumsum() + 1

Output:

[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9]
IanS
  • 15,771
  • 9
  • 60
  • 84
2

Slightly different approach from @IanS. I don't know which will be faster. This one actually calculates the difference in months

def diff_in_months(date1, date2):
    import itertools
#     print(date1, date2)
    x, y = max(date1, date2), min(date1, date2)
    coefficients = 12, 100, 24, 100, 100, 1
    coefficients = list(reversed([i for i in itertools.accumulate(reversed(coefficients), operator.mul)]))

    return (sum(i * j for i, j in zip(coefficients, x.timetuple())) - sum(i * j for i, j in zip(coefficients, y.timetuple()))) // coefficients[1]

This can be sped up slightly by calculating the coefficients (and using a global variable) only once instead of on every call of the method

def my_grouping(df):
    i = 1
    v1 = 0
    v2 = df['v2'].iloc[0]
    for row in df.itertuples():
#         print(row)
        if diff_in_months(v2, row.v2) >= 2 or (v1 + row.v1 >= 41):
            i += 1
            v1 = row.v1
            v2 = row.v2
        else:
            v1 += row.v1
        yield i

flag_series = pd.Series(my_grouping(dd), index = dd.index))
dd.assign(flag=flag_series, expected_flag = [1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9])
    v1  v2  expected_flag   flag
5   4   2016-05-03  1   1
11  14  2016-05-26  1   1
19  12  2016-06-26  1   1
15  8   2016-07-06  2   2
7   27  2016-08-04  2   2
4   9   2016-08-28  3   3
17  5   2016-09-08  3   3
13  16  2016-10-04  3   3
14  14  2016-10-10  4   4
18  18  2016-11-25  4   4
3   6   2016-12-03  4   4
8   19  2016-12-04  5   5
12  1   2016-12-12  5   5
10  28  2017-01-14  6   6
1   2   2017-02-12  6   6
0   12  2017-02-15  7   7
9   28  2017-03-11  7   7
6   29  2017-03-18  8   8
16  7   2017-03-21  8   8
2   13  2017-04-29  9   9

Arbitrary interval

def my_grouping_arbitrary_interval(df, diff_v1 = 41, interval = pd.Timedelta(61, 'D')):
    i = 1
    v1 = 0
    v2 = df['v2'].iloc[0]
    for row in df.itertuples():
#         print(row)
        if max(v2, row.v2) - min(v2, row.v2) >= interval or (v1 + row.v1 >= diff_v1):
            i += 1
            v1 = row.v1
            v2 = row.v2
        else:
            v1 += row.v1
        yield i

The problem with this is that pd.Timedelta takes any of thes unit : string, [D,h,m,s,ms,us,ns] as input, so no months or years. For those you will have to adapt my diff_in_months

Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36
  • Thank you @Maarten. Can this be scaled for say seconds instead of months? In the end a function with arguments `data frame`, `sum of v1` and `time interval` would be ideal...something like `f(df, count = 40, time = 10mins)` – Sotos Sep 08 '17 at 09:28
  • Yes!!! That would be the function. However, It does not return anything. It gives `Out[290]: `. Sorry If I sound a bit retarded. I m very new to python (I come from R background) – Sotos Sep 08 '17 at 09:44
  • that is because it is a generator. It does not return a dump of data in 1 shot, but one value at a time. You need to use it slightly differently: `flag_series = pd.Series(my_grouping_arbitrary_interval(dd), index = dd.index))` – Maarten Fabré Sep 08 '17 at 09:47
  • Seems to be working. Thanks! I ll run it on the entire dataset and double check before I accept your answer. Thanks again – Sotos Sep 08 '17 at 11:37
  • I m trying to make the example that failed reproducible and edit my Q. I basically specified 335 v1 and 15 minutes window and the first group stopped at 320 (next value was 6) and at 6 minutes(next was in few seconds) – Sotos Sep 08 '17 at 12:15
  • I updated my question with part of the real data set where it fails. – Sotos Sep 08 '17 at 12:26
  • 1
    it should be `v1 = 0` – Maarten Fabré Sep 08 '17 at 14:57
  • Yup. That did it. Thank you @Maarten – Sotos Sep 11 '17 at 06:26