find largest rolling window given a conditional expression pandas

Question

I am looking to find the largest window within a list, given a condition, and return those specific records from my dataframe. I have a person, and then an activity that they have done, and I sort chronologically and find the difference in seconds between their activities. Then I want to find the largest consecutive section where every activity is under a given amount of seconds

I've created a function return max segment:

def return_max_segment(list_, max_seconds):
    counter = 0
    results = [0]
    for x in list_:
        if x <= max_seconds:
            counter += 1
            results.append(counter)
        else:
            counter = 0
    if len(results) == 0:
        final = 0
    else:
        final = max(results)
    return final

Initialize the Data:

def rand_date(x, y):
    return datetime.datetime(2018, 9, 20, 5, x, y)

random.seed(123)
dates = [rand_date(random.randint(0, 59), random.randint(0, 59)) for x in range(100)]
person = [random.choice(['005', '088']) for x in range(100)]
ids = range(100)

df = pd.DataFrame({'id':ids,
                   'person':person,
                   'activitydate':dates})

Look at head of df:

   id person        activitydate
0   0    088 2018-09-20 05:03:17
1   1    005 2018-09-20 05:05:49
2   2    005 2018-09-20 05:26:17
3   3    088 2018-09-20 05:06:53
4   4    005 2018-09-20 05:57:55

So far I have made it thus far and to groupby the dataframe to return the max segment. I'm getting stuck on how to then get the specific records that this max segment refers to because I need the record id.

dfg = df.groupby('person')['activitydate'] \
         .apply(lambda x: return_max_segment([y.seconds for y in x.sort_values().diff()[1:]], 50)) \
         .reset_index(drop=False)

  person  activitydate
0    005             4
1    088             8

There is a segment of 4 consecutive records within 005's records that are each within 50 seconds of one another. I'd like to return those 4 records. Same for 008

score 0 · Answer 1 · answered Jan 18 '19 at 22:09

Here's a way to do that in your case. First we'll do it for one person to illustrate the method.

gb = df.groupby('person')
group5 = gb.get_group('005')
sorted_by_date = group5.sort_values('activitydate').reset_index(drop=True)
sorted_by_date['date_diff'] = sorted_by_date.activitydate.diff()
run_groups = sorted_by_date.groupby(
   (sorted_by_date.date_diff >= pd.Timedelta('50 seconds')).cumsum()
)
run_groups.get_group(run_groups.id.count().idxmax())

which gives

    id person        activitydate date_diff  run_count
99  99    005 2018-09-20 05:40:10  00:01:30          0
68  68    005 2018-09-20 05:40:18  00:00:08          1
92  92    005 2018-09-20 05:40:25  00:00:07          2
82  82    005 2018-09-20 05:40:26  00:00:01          3
48  48    005 2018-09-20 05:41:11  00:00:45          4

the tricky part is sorted_by_date.groupby((sorted_by_date.date_diff >= pd.Timedelta('50 seconds')).cumsum()), which is the "compare-cumsum-groupby pattern" I found in this great answer. In words, this creates a boolean column where False means that the difference between activities is short, then gets the cumulative sum of that column such that in a run of short differences, the sum isn't changing, then groups by that sum. Then we just find the group with the biggest count.

Now to apply this to all people at once:

def get_longest_short_activity_run_length(group):
    sorted_by_date = group.sort_values('activitydate')
    sorted_by_date['date_diff'] = sorted_by_date.activitydate.diff()
    run_groups = sorted_by_date.groupby(
       (sorted_by_date.date_diff >= pd.Timedelta('50 seconds')).cumsum()
    )
    return run_groups.get_group(run_groups.id.count().idxmax())

gb.apply(get_longest_short_activity_run_length)

which returns

           id person        activitydate date_diff
person                                            
005    99  99    005 2018-09-20 05:40:10  00:01:30
       68  68    005 2018-09-20 05:40:18  00:00:08
       92  92    005 2018-09-20 05:40:25  00:00:07
       82  82    005 2018-09-20 05:40:26  00:00:01
       48  48    005 2018-09-20 05:41:11  00:00:45
088    36  36    088 2018-09-20 05:00:25       NaT
       59  59    088 2018-09-20 05:01:12  00:00:47
       24  24    088 2018-09-20 05:01:18  00:00:06
       93  93    088 2018-09-20 05:01:20  00:00:02
       98  98    088 2018-09-20 05:01:40  00:00:20
       76  76    088 2018-09-20 05:02:00  00:00:20
       95  95    088 2018-09-20 05:02:48  00:00:48
       0    0    088 2018-09-20 05:03:17  00:00:29
       17  17    088 2018-09-20 05:04:00  00:00:43

find largest rolling window given a conditional expression pandas

1 Answers1