Number of concurrent events per username in Pandas

Question

I have a table like the following but approximately 7 million rows. What I am trying to find out is how many cases is each user working on simultaneously? I would like to groupby the username and then get an average count of how many references are open concurrently between the two times.

Reference	starttime	stoptime	Username
1	2020-07-28 06:41:56.000	2020-07-28 07:11:25.000	Arthur
2	2020-07-18 13:24:02.000	2020-07-18 13:38:42.000	Arthur
3	2020-07-03 09:27:03.000	2020-07-03 10:35:24.000	Arthur
4	2020-07-05 19:42:38.000	2020-07-05 20:07:52.000	Bob
5	2020-07-04 10:22:48.000	2020-07-04 10:24:32.000	Bob

Any ideas?

Does [Efficient date range overlap calculation in python?](https://stackoverflow.com/questions/9044084/efficient-date-range-overlap-calculation-in-python) answer your question? Or [Determine Whether Two Date Ranges Overlap](https://stackoverflow.com/questions/325933/determine-whether-two-date-ranges-overlap), [Algorithm to detect overlapping periods](https://stackoverflow.com/questions/13513932/algorithm-to-detect-overlapping-periods), [How to find rows with overlapping date ranges?](https://stackoverflow.com/questions/44288858/how-to-find-rows-with-overlapping-date-ranges) — wwii, Feb 18 '21 at 15:15
More Pandas: [Efficiently finding overlap between many date ranges](https://stackoverflow.com/questions/35266424/efficiently-finding-overlap-between-many-date-ranges), — wwii, Feb 18 '21 at 15:22

score 0 · Answer 1 · answered Feb 18 '21 at 15:07

Someone asked a similar question just yesterday so here it is:

ends = df['starttime'].values < df['endtime'].values[:, None]
starts = df['starttime'].values > df['starttime'].values[:, None]
same_name = (df['Username'].values == df['Username'].values[:, None])

# check for rows where all three conditions are met
# count the nubmer of matches by sum across axis=1 !!!
df['overlap'] = (ends & starts & same_name).sum(1)

df

To answer your final question for the mean value you would then run:

df['overlap'].mean()

score 0 · Answer 2 · answered Feb 18 '21 at 15:23

I would use Pandas groupby function as you suggested in your tag already, by username. Let me describe the general workflow below per grouped user:

Collect all start times and stop times as 'moments of change in activities'.
Loop over all of them in your grouped dataframe
Use e.g. Pandas.DataFrame.loc to check how many cases are 'active' at moments of changes.
Save these in a list to compute the average count of cases

I don't have your code, but in pseudo-code it would look something like:

df = ...  # your raw df
grouped = df.groupby(by='Username')

for user, user_df in grouped:
    cases = []
    user_starts_cases = user_df['starttime'].to_numpy()
    user_stops_cases = user_df['stoptime'].to_numpy()
    times_of_activity_changes = np.concatenate(user_starts_cases, user_stops_cases)
    
    for xs in times_of_activity_changes:
        num_activities = len(user_df.loc[(user_df['starttime'] <= xs) & (user_df['stoptime'] >= xs)])  # mind the brackets
        active_cases.append(num_activities)
    
    print(sum(active_cases)/len(active_cases))

It depends a bit what you would call 'on average' but with this you could sample the amount of active cases at the times of your interest and compute an average.

Number of concurrent events per username in Pandas

2 Answers2