This question builds on a question already answered here - Pandas Cumulative Sum using Current Row as Condition
To extend this question I am looking to find the maximum number of concurrent unique users, so I would need to ignore (or not count) any instances where the same user is connecting more than once during the period (row) in question.
I have used the answer to the question referenced above to be able to count the maximum concurrent users at the time for each row but it doesn't take into account the user.
One additional issue to point out here is that there may be overlaps in terms of a user log with itself. What I mean is the following is possible for a single user:
START_TIME END_TIME USER_NAME
29.03.2012 20:18:00 29.03.2012 20:54:13 abc
29.03.2012 20:19:42 29.03.2012 20:40:24 abc
29.03.2012 20:23:03 29.03.2012 20:40:24 abc
I have created the code below to get the max concurrent users:
import pandas as pd
import sys
filename = sys.argv[1]
# Read the csv into the pandas dataframe
df = pd.read_csv(filename)
# Convert start time and end time columns to datetime format
df["START_TIME"] = pd.to_datetime(df["START_TIME"])
df["END_TIME"] = pd.to_datetime(df["END_TIME"])
active_events= {}
# Count instances where the start time was less than or equal to reference start and end time is greater than reference start time
for i in df.index:
active_events[i] = len(df[(df["START_TIME"]<=df.loc[i,"START_TIME"]) & (df["END_TIME"]> df.loc[i,"START_TIME"])])
last_columns = pd.DataFrame({'CONCURRENT_EVENTS' : pd.Series(active_events)})
# Add the newly created column to the original file
total = df.join(last_columns)
# Save the output to a new csv file using part of the original filename
total.to_csv("Output" + " - " + filename)
I have tried including a check against the user name but I was only able to filter out users that were the same as the user in the current row, not duplicates for all concurrent users.
This is an example of the data from the csv input:
SESSION_ID START_TIME END_TIME USER_NAME
45030 29.03.2020 14:37 29.03.2020 19:01 jkk
45033 29.03.2020 14:46 29.03.2020 16:23 ady
45035 29.03.2020 14:54 29.03.2020 18:27 ady
45036 29.03.2020 15:51 29.03.2020 17:34 drm
45040 29.03.2020 17:38 29.03.2020 22:07 ady
45042 29.03.2020 18:58 29.03.2020 20:25 djx
The code currently returns the following output csv:
SESSION_ID START_TIME END_TIME USER_NAME CONCURRENT_EVENTS
0 45030 29.03.2020 14:37 29.03.2020 19:01 jkk 1
1 45033 29.03.2020 14:46 29.03.2020 16:23 ady 2
2 45035 29.03.2020 14:54 29.03.2020 18:27 ady 3
3 45036 29.03.2020 15:51 29.03.2020 17:34 drm 4
4 45040 29.03.2020 17:38 29.03.2020 22:07 ady 3
5 45042 29.03.2020 18:58 29.03.2020 20:25 djx 3
What I want to return is this, adjusting the concurrent count if the user has already been counted for the row in question, this output shows the user ady only counted once for each concurrent event calculation:
SESSION_ID START_TIME END_TIME USER_NAME CONCURRENT_EVENTS
0 45030 29.03.2020 14:37 29.03.2020 19:01 jkk 1
1 45033 29.03.2020 14:46 29.03.2020 16:23 ady 2
2 45035 29.03.2020 14:54 29.03.2020 18:27 ady 2
3 45036 29.03.2020 15:51 29.03.2020 17:34 drm 3
4 45040 29.03.2020 17:38 29.03.2020 22:07 ady 2
5 45042 29.03.2020 18:58 29.03.2020 20:25 djx 3
Any help or ideas are much appreciated, thanks in advance.