-1

I have a table like the following but approximately 7 million rows. What I am trying to find out is how many cases is each user working on simultaneously? I would like to groupby the username and then get an average count of how many references are open concurrently between the two times.

Reference starttime stoptime Username
1 2020-07-28 06:41:56.000 2020-07-28 07:11:25.000 Arthur
2 2020-07-18 13:24:02.000 2020-07-18 13:38:42.000 Arthur
3 2020-07-03 09:27:03.000 2020-07-03 10:35:24.000 Arthur
4 2020-07-05 19:42:38.000 2020-07-05 20:07:52.000 Bob
5 2020-07-04 10:22:48.000 2020-07-04 10:24:32.000 Bob

Any ideas?

rdr101
  • 1
  • What part are you having trouble with? – wwii Feb 18 '21 at 15:07
  • Does [Efficient date range overlap calculation in python?](https://stackoverflow.com/questions/9044084/efficient-date-range-overlap-calculation-in-python) answer your question? Or [Determine Whether Two Date Ranges Overlap](https://stackoverflow.com/questions/325933/determine-whether-two-date-ranges-overlap), [Algorithm to detect overlapping periods](https://stackoverflow.com/questions/13513932/algorithm-to-detect-overlapping-periods), [How to find rows with overlapping date ranges?](https://stackoverflow.com/questions/44288858/how-to-find-rows-with-overlapping-date-ranges) – wwii Feb 18 '21 at 15:15
  • More Pandas: [Efficiently finding overlap between many date ranges](https://stackoverflow.com/questions/35266424/efficiently-finding-overlap-between-many-date-ranges), – wwii Feb 18 '21 at 15:22

2 Answers2

0

Someone asked a similar question just yesterday so here it is:

ends = df['starttime'].values < df['endtime'].values[:, None]
starts = df['starttime'].values > df['starttime'].values[:, None]
same_name = (df['Username'].values == df['Username'].values[:, None])

# check for rows where all three conditions are met
# count the nubmer of matches by sum across axis=1 !!!
df['overlap'] = (ends & starts & same_name).sum(1)

df

To answer your final question for the mean value you would then run:

df['overlap'].mean()
Matt
  • 1,196
  • 1
  • 9
  • 22
0

I would use Pandas groupby function as you suggested in your tag already, by username. Let me describe the general workflow below per grouped user:

  • Collect all start times and stop times as 'moments of change in activities'.
  • Loop over all of them in your grouped dataframe
  • Use e.g. Pandas.DataFrame.loc to check how many cases are 'active' at moments of changes.
  • Save these in a list to compute the average count of cases

I don't have your code, but in pseudo-code it would look something like:

df = ...  # your raw df
grouped = df.groupby(by='Username')

for user, user_df in grouped:
    cases = []
    user_starts_cases = user_df['starttime'].to_numpy()
    user_stops_cases = user_df['stoptime'].to_numpy()
    times_of_activity_changes = np.concatenate(user_starts_cases, user_stops_cases)
    
    for xs in times_of_activity_changes:
        num_activities = len(user_df.loc[(user_df['starttime'] <= xs) & (user_df['stoptime'] >= xs)])  # mind the brackets
        active_cases.append(num_activities)
    
    print(sum(active_cases)/len(active_cases))

It depends a bit what you would call 'on average' but with this you could sample the amount of active cases at the times of your interest and compute an average.