I am looking to find the largest window within a list, given a condition, and return those specific records from my dataframe. I have a person, and then an activity that they have done, and I sort chronologically and find the difference in seconds between their activities. Then I want to find the largest consecutive section where every activity is under a given amount of seconds
I've created a function return max segment:
def return_max_segment(list_, max_seconds):
counter = 0
results = [0]
for x in list_:
if x <= max_seconds:
counter += 1
results.append(counter)
else:
counter = 0
if len(results) == 0:
final = 0
else:
final = max(results)
return final
Initialize the Data:
def rand_date(x, y):
return datetime.datetime(2018, 9, 20, 5, x, y)
random.seed(123)
dates = [rand_date(random.randint(0, 59), random.randint(0, 59)) for x in range(100)]
person = [random.choice(['005', '088']) for x in range(100)]
ids = range(100)
df = pd.DataFrame({'id':ids,
'person':person,
'activitydate':dates})
Look at head of df:
id person activitydate
0 0 088 2018-09-20 05:03:17
1 1 005 2018-09-20 05:05:49
2 2 005 2018-09-20 05:26:17
3 3 088 2018-09-20 05:06:53
4 4 005 2018-09-20 05:57:55
So far I have made it thus far and to groupby the dataframe to return the max segment. I'm getting stuck on how to then get the specific records that this max segment refers to because I need the record id
.
dfg = df.groupby('person')['activitydate'] \
.apply(lambda x: return_max_segment([y.seconds for y in x.sort_values().diff()[1:]], 50)) \
.reset_index(drop=False)
person activitydate
0 005 4
1 088 8
There is a segment of 4 consecutive records within 005
's records that are each within 50 seconds of one another. I'd like to return those 4 records. Same for 008