Count max concurrent unique users for a log file

Question

This question builds on a question already answered here - Pandas Cumulative Sum using Current Row as Condition

To extend this question I am looking to find the maximum number of concurrent unique users, so I would need to ignore (or not count) any instances where the same user is connecting more than once during the period (row) in question.

I have used the answer to the question referenced above to be able to count the maximum concurrent users at the time for each row but it doesn't take into account the user.

One additional issue to point out here is that there may be overlaps in terms of a user log with itself. What I mean is the following is possible for a single user:

START_TIME           END_TIME               USER_NAME
29.03.2012 20:18:00  29.03.2012 20:54:13    abc
29.03.2012 20:19:42  29.03.2012 20:40:24    abc
29.03.2012 20:23:03  29.03.2012 20:40:24    abc

I have created the code below to get the max concurrent users:

import pandas as pd
import sys

filename = sys.argv[1]

# Read the csv into the pandas dataframe
df = pd.read_csv(filename)

# Convert start time and end time columns to datetime format
df["START_TIME"] = pd.to_datetime(df["START_TIME"])
df["END_TIME"] = pd.to_datetime(df["END_TIME"])

active_events= {}
# Count instances where the start time was less than or equal to reference start and end time is greater than reference start time
for i in df.index:
    active_events[i] = len(df[(df["START_TIME"]<=df.loc[i,"START_TIME"]) & (df["END_TIME"]> df.loc[i,"START_TIME"])])
last_columns = pd.DataFrame({'CONCURRENT_EVENTS' : pd.Series(active_events)})

# Add the newly created column to the original file
total = df.join(last_columns)

# Save the output to a new csv file using part of the original filename
total.to_csv("Output" + " - " + filename)

I have tried including a check against the user name but I was only able to filter out users that were the same as the user in the current row, not duplicates for all concurrent users.

This is an example of the data from the csv input:

SESSION_ID  START_TIME          END_TIME            USER_NAME
45030       29.03.2020 14:37    29.03.2020 19:01    jkk
45033       29.03.2020 14:46    29.03.2020 16:23    ady
45035       29.03.2020 14:54    29.03.2020 18:27    ady
45036       29.03.2020 15:51    29.03.2020 17:34    drm
45040       29.03.2020 17:38    29.03.2020 22:07    ady
45042       29.03.2020 18:58    29.03.2020 20:25    djx

The code currently returns the following output csv:

    SESSION_ID  START_TIME          END_TIME            USER_NAME   CONCURRENT_EVENTS
0   45030       29.03.2020 14:37    29.03.2020 19:01    jkk         1
1   45033       29.03.2020 14:46    29.03.2020 16:23    ady         2
2   45035       29.03.2020 14:54    29.03.2020 18:27    ady         3
3   45036       29.03.2020 15:51    29.03.2020 17:34    drm         4
4   45040       29.03.2020 17:38    29.03.2020 22:07    ady         3
5   45042       29.03.2020 18:58    29.03.2020 20:25    djx         3

What I want to return is this, adjusting the concurrent count if the user has already been counted for the row in question, this output shows the user ady only counted once for each concurrent event calculation:

    SESSION_ID  START_TIME          END_TIME            USER_NAME   CONCURRENT_EVENTS
0   45030       29.03.2020 14:37    29.03.2020 19:01    jkk         1
1   45033       29.03.2020 14:46    29.03.2020 16:23    ady         2
2   45035       29.03.2020 14:54    29.03.2020 18:27    ady         2
3   45036       29.03.2020 15:51    29.03.2020 17:34    drm         3
4   45040       29.03.2020 17:38    29.03.2020 22:07    ady         2
5   45042       29.03.2020 18:58    29.03.2020 20:25    djx         3

Any help or ideas are much appreciated, thanks in advance.

why `CONCURRENT_EVENTS` 3 for `drm` and 3 for `djx`? looks like should be 4 for `djx` — Danila Ganchar, Feb 09 '21 at 15:21
For index 5 `djx` it is only index 0 `jkk` and index 4 `ady` that are being accessed at the same time so the output is 3. Btw thanks to everyone for the answers, this enabled me to solve the problem that I had been stuck on for some time! — snowdog, Feb 10 '21 at 13:12

score -1 · Accepted Answer · answered Feb 09 '21 at 15:23

Before assigning the len to active_events[i] you can group by USER_NAME to have the number of distinct users.

for i in df.index:
    active_events[i] = len(df[(df["START_TIME"]<=df.loc[i,"START_TIME"]) & (df["END_TIME"]> df.loc[i,"START_TIME"])].groupby('USER_NAME'))
last_columns = pd.DataFrame({'CONCURRENT_EVENTS' : pd.Series(active_events)})

Outputs :

   SESSION_ID          START_TIME            END_TIME USER_NAME  CONCURRENT_EVENTS
0       45030 2020-03-29 14:37:00 2020-03-29 19:01:00       jkk                  1
1       45033 2020-03-29 14:46:00 2020-03-29 16:23:00       ady                  2
2       45035 2020-03-29 14:54:00 2020-03-29 18:27:00       ady                  2
3       45036 2020-03-29 15:51:00 2020-03-29 17:34:00       drm                  3
4       45040 2020-03-29 17:38:00 2020-03-29 22:07:00       ady                  2
5       45042 2020-03-29 18:58:00 2020-03-29 20:25:00       djx                  3

supercooler8 · Answer 2 · 2021-02-11T07:24:00.860

-1

I also answered your previous question, in general, I believe it may be more pythonic to use apply than using for loops. But the simple answer to your question, if I understand correctly is simply to change one line in the for loop:

active_events[i] = df[
(df["START_TIME"]<=df.loc[i,"START_TIME"]) &
(df["END_TIME"]> df.loc[i,"START_TIME"]) 
]["USER_NAME"].nunique()

edited Feb 11 '21 at 07:24

answered Feb 09 '21 at 17:03

supercooler8

503
2
7

Count max concurrent unique users for a log file

2 Answers2