3

I have a table of logins and logouts by user.

the table looks like this but has a few hundred thousand rows:

data = [['aa', '2020-05-31 00:00:01', '2020-05-31 00:00:31'],
        ['bb','2020-05-31 00:01:01', '2020-05-31 00:02:01'],
        ['aa','2020-05-31 00:02:01', '2020-05-31 00:06:03'],
        ['cc','2020-05-31 00:03:01', '2020-05-31 00:04:01'],
        ['dd','2020-05-31 00:04:01', '2020-05-31 00:34:01'],
        ['aa', '2020-05-31 00:05:01', '2020-05-31 00:07:31'],
        ['bb','2020-05-31 00:05:01', '2020-05-31 00:06:01'],
        ['aa','2020-05-31 00:05:01', '2020-05-31 00:08:03'],
        ['cc','2020-05-31 00:10:01', '2020-05-31 00:40:01'],
        ['dd','2020-05-31 00:20:01', '2020-05-31 00:35:01']]


df_test = pd.DataFrame(data,  columns=['user_id','login', 'logout'], dtype='datetime64[ns]')

I was able to solve this problem in a hacky way using a for loop. It works fine on a smaller dataset but takes hours on 300k rows.

Basically, this code calculates how many users were logged in at the same time for each session (session being each row)

Here is my solution. it gives the result that i need. I was also able to do the same by writing a lambda with apply but it takes even longer.

# create a new column for simultaneous
df_test['simultaneous'] = 0

start_time = time.time()

# loop through dataframe and check condition
for i in df_test.index:
    login, logout = df_test.loc[i,'login'], df_test.loc[i,'logout']
    this_index = df_test.index.isin([i])
    df_test.loc[i, 'simultaneous'] = int(sum(
        (df_test[~this_index]['login'] <= logout) & (df_test[~this_index]['logout'] >= login)
    ))
print("--- %s seconds ---" % (time.time() - start_time))

Could you please take a look and let me know if there is a much better way of getting to the same result. Maybe im missing something obvious .

Thanks in advance!

Nikita Voevodin
  • 137
  • 2
  • 14

3 Answers3

1

This algorithm takes a streaming approach based on the fact that this data is sorted by login time. For each session, it keeps track of a count of all sessions whose logout time hasn't yet passed (by simply storing the logout time in a list, and removing stale entries from that list each time you check a new session). I decided to count a sess1.logout==sess2.login as occurring simultaneously, but you can change the >= to > if you disagree.

The algorithm is in the calculate function.

#!/usr/bin/python

import datetime
import random
import time
from statistics import mean, stdev


def calculate(data):
    active_sessions = []
    simultaneous_sessions = []
    for user_id, login, logout in data:
        active_sessions = [ts for ts in active_sessions if ts >= login]
        simultaneous_sessions.append(len(active_sessions))
        active_sessions.append(logout)
    return simultaneous_sessions


def generate_data(numsessions):
    start_time = datetime.datetime(2020, 5, 13, 0, 0, 1)
    data = []
    while len(data) < numsessions:
        for cnt in range(random.choice([0, 0, 0, 1, 1, 2, 3])):
            user_id = chr(ord("a") + cnt) * 2
            duration = random.choice([30, 30, 60, 90, 90, 900, 1800])
            logout_time = start_time + datetime.timedelta(seconds=duration)
            data.append(
                (
                    user_id,
                    start_time.strftime("%Y-%m-%d %H:%M:%S"),
                    logout_time.strftime("%Y-%m-%d %H:%M:%S"),
                )
            )

        start_time += datetime.timedelta(minutes=1)
    return data


start_time = time.time()
num_sessions = 3 * 1e5  # 300,000
print(f"generating data for {num_sessions:.0f} sessions")
data = generate_data(num_sessions)
print(f"sample data=[{data[0]}]")
print("--- %.2f seconds ---" % (time.time() - start_time))
start_time = time.time()
print("calculating simultaneous sessions")
simultaneous_sessions = calculate(data)
print(
    "for {} sessions have max={} min={}, mean={:.2f} stdev={:.2f}".format(
        len(simultaneous_sessions),
        max(simultaneous_sessions),
        min(simultaneous_sessions),
        mean(simultaneous_sessions),
        stdev(simultaneous_sessions),
    )
)
print("--- %.2f seconds ---" % (time.time() - start_time))

From a performance perspective, I walk the list once, and while I constantly recreate the active_sessions list, that will be quick as long as the active_sessions is a small number. There are other optimizations you could make by having a more efficient active_sessions list, but this should be much faster then searching all data for every session. Even if the data wasn't sorted by login time, I think it would be more efficient to sort by login time and then use this algorithm than scanning all sessions for each session.

UPDATE: I've added a synthetic data generator, which creates a bunch of sessions, based on some random variables. This shows that this algorithm will take less then a second for 300k rows.

for 300k sessions it takes ~0.4 seconds:

generating data for 300000 sessions
sample data=[('aa', '2020-05-13 00:02:01', '2020-05-13 00:03:31')]
--- 1.99 seconds ---
calculating simultaneous sessions
for 300001 sessions have max=26 min=0, mean=7.42 stdev=2.76
--- 0.35 seconds ---

for 3 million sessions it takes ~4 seconds:

generating data for 3000000 sessions
sample data=[('aa', '2020-05-13 00:00:01', '2020-05-13 00:01:31')]
--- 19.35 seconds ---
calculating simultaneous sessions
for 3000001 sessions have max=26 min=0, mean=7.43 stdev=2.77
--- 3.93 seconds ---

O(N)

toppk
  • 696
  • 4
  • 10
0

Try this solution, on your data * 30_000 it took ~1900 seconds to compute the result (AMD 3700X/Python 3.9.7) - but I'm not sure how it will perform on real data:

mn = df_test["login"].min()
mx = df_test["logout"].max()
tmp = pd.Series(0, index=pd.date_range(mn, mx, freq="S"), dtype=object)


def fn1(x):
    tmp[x["login"] : x["logout"]] = [
        v | (1 << x.name) for v in tmp[x["login"] : x["logout"]]
    ]


def fn2(x):
    out = 0
    for v in tmp[x["login"] : x["logout"]]:
        out |= v

    # If you use Python 3.10+ you can use this answer
    # https://stackoverflow.com/a/64848298/10035985
    # which should be ~6x faster instead of this:
    return bin(out).count("1") - 1


df_test.apply(fn1, axis=1)
df_test["sim"] = df_test.apply(fn2, axis=1)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

If you restructure your data, you can do a single pass through it. This is a good application for pandas.melt:

# use a session id, as opposed to a user id, as a single user can log in multiple times:
df_test['sid'] = df_test.user_id + "-" + df_test.index.astype(str)

#df_test 
#  user_id               login              logout   sid
#0      aa 2020-05-31 00:00:01 2020-05-31 00:00:31  aa-0
#1      bb 2020-05-31 00:01:01 2020-05-31 00:02:01  bb-1
#2      aa 2020-05-31 00:02:01 2020-05-31 00:06:03  aa-2
#3      cc 2020-05-31 00:03:01 2020-05-31 00:04:01  cc-3
#4      dd 2020-05-31 00:04:01 2020-05-31 00:34:01  dd-4
#5      aa 2020-05-31 00:05:01 2020-05-31 00:07:31  aa-5
#6      bb 2020-05-31 00:05:01 2020-05-31 00:06:01  bb-6
#7      aa 2020-05-31 00:05:01 2020-05-31 00:08:03  aa-7
#8      cc 2020-05-31 00:10:01 2020-05-31 00:40:01  cc-8
#9      dd 2020-05-31 00:20:01 2020-05-31 00:35:01  dd-9

# restructure the data, and sort it
df_chrono = pd.melt(df_test.set_index('sid'), value_vars=['login', 'logout'], ignore_index=False)

df_chrono = df_chrono.sort_values(by='value').reset_index()
#df_chrono:
#     sid variable               value
#0   aa-0    login 2020-05-31 00:00:01
#1   aa-0   logout 2020-05-31 00:00:31
#2   bb-1    login 2020-05-31 00:01:01
#3   aa-2    login 2020-05-31 00:02:01
#4   bb-1   logout 2020-05-31 00:02:01
#5   cc-3    login 2020-05-31 00:03:01
#6   dd-4    login 2020-05-31 00:04:01
#7   cc-3   logout 2020-05-31 00:04:01
#8   aa-5    login 2020-05-31 00:05:01
#9   bb-6    login 2020-05-31 00:05:01
#10  aa-7    login 2020-05-31 00:05:01
#11  bb-6   logout 2020-05-31 00:06:01
#12  aa-2   logout 2020-05-31 00:06:03
#13  aa-5   logout 2020-05-31 00:07:31
#14  aa-7   logout 2020-05-31 00:08:03
#15  cc-8    login 2020-05-31 00:10:01
#16  dd-9    login 2020-05-31 00:20:01
#17  dd-4   logout 2020-05-31 00:34:01
#18  dd-9   logout 2020-05-31 00:35:01
#19  cc-8   logout 2020-05-31 00:40:01

With the chronological data, we can pass through and easily track who's logged in at each iteration (Note: see update below for a more optimized version of the following loop)

# keep track of the current logins in simul_tracker, allowing for a single pass through the data

simul_track = {}
results = {"sid": [], "simul":[]}

for i,row in df_chrono.iterrows():
    
    if row.variable=='login':
        for sid in simul_track:
            simul_track[sid] += 1
        
        if row.sid not in simul_track:
            simul_track[row.sid] = len(simul_track)  # number of current logins
    
    else:
        results['simul'].append(simul_track.pop(row.sid))
        results['sid'].append (row.sid)

#results 
#{'sid': ['aa-0',
#  'bb-1',
#  'cc-3',
#  'bb-6',
#  'aa-2',
#  'aa-5',
#  'aa-7',
#  'dd-4',
#  'dd-9',
#  'cc-8'],
# 'simul': [0, 1, 2, 4, 6, 4, 4, 7, 2, 2]}

You can update the original dataframe using the results dict (note the results key 'sid' is critical for alignment)

pd.merge(df_test, pd.DataFrame(results), on='sid') 
#  user_id               login              logout   sid  simul
#0      aa 2020-05-31 00:00:01 2020-05-31 00:00:31  aa-0      0
#1      bb 2020-05-31 00:01:01 2020-05-31 00:02:01  bb-1      1
#2      aa 2020-05-31 00:02:01 2020-05-31 00:06:03  aa-2      6
#3      cc 2020-05-31 00:03:01 2020-05-31 00:04:01  cc-3      2
#4      dd 2020-05-31 00:04:01 2020-05-31 00:34:01  dd-4      7
#5      aa 2020-05-31 00:05:01 2020-05-31 00:07:31  aa-5      4
#6      bb 2020-05-31 00:05:01 2020-05-31 00:06:01  bb-6      4
#7      aa 2020-05-31 00:05:01 2020-05-31 00:08:03  aa-7      4
#8      cc 2020-05-31 00:10:01 2020-05-31 00:40:01  cc-8      2
#9      dd 2020-05-31 00:20:01 2020-05-31 00:35:01  dd-9      2

Update

If there are a large number of users logged in at the same time, the above dictionary updates (for sid in simul_track: simul_track[sid] += 1) can become a bottleneck. To get around that, one can use the following scheme:

import numpy as np
import time

t = time.time()
results = {"sid": [], "simul":[]}
n_records = len(df_chrono)
n_active = 0  # we will track the number of active logins here

# create an array for quick incremental updates
# Each session id gets a unique element in tracker
n_session = len(df_test)
tracker = np.zeros(n_session, dtype=np.uint)
# we create a 1-to-1 mapping from session id to the tracker array
idx_from_sid = {sid:i for i,sid in zip(df_test.index, df_test.sid)}

for i,row in df_chrono.iterrows():
    idx = idx_from_sid[row.sid]  # position in data array corresonding to this particular session id
    
    # print progress
    if i % 100==0:
        perc_done = i / n_records * 100.
        print("prog=%.2f%% (rt=%.3fsec)."% (perc_done, time.time()-t), flush=True, end='\r' )
    
    if row.variable=='login':
        # We track two quantities
        # The first is how many additional users log in after current sid starts
        tracker += 1  # never mind that we increment all values here; on the next line we override this particular sessions value

        # the second is how many active users there are when this session id starts log in
        tracker[idx] = n_active
        n_active += 1
    else:
        n_active = n_active - 1
        count = tracker[idx]
        results['simul'].append(count)
        results['sid'].append(row.sid)
print("")

Similar to one of the other answers, I timed this on data*30000 to simulate scaling for 300,000 rows, and was able to compute the simultaneous values in ~110 sec.


Now, given my answer, you might still be interested in your original solution, and there were several optimizations you can make with that one as well. In particular, df_test.loc[~this_index]: this need only be done once per iteration. Further, df.loc[this_index] is a single row in the dataframe, and (df_test[this_index]['login'] <= logout) & (df_test[this_index]['logout'] >= login) will always be True , hence no need to do the slicing:

df_test.reset_index(drop=True) # just in case
for i, row in df_test.iterrows():
    
    df_test.loc[i, 'simultaneous'] = int(np.sum(
        (df_test.login <= row.logout) & (df_test.logout >= row.login)
    )) -1  # note the subtraction by one saves us from having to do df.loc[~this_index]

    # alternatively, you can try to use numexpr to speed up the element wise comparisons
    #in_lt_out = pd.eval('df_test.login <= row.logout', engine='numexpr')
    #out_gt_in = pd.eval('df_test.logout >= row.login', engine='numexpr')
    #simul = np.sum(pd.eval('in_lt_out & out_gt_in', engine='numexpr'))
    #df_test.loc[i, 'simultaneous'] = int(simul-1)


Note, Im curious what you were trying to do with the .isin call, it makes me think your definition of simultaneous was perhaps for unique users, however, here, and in your solution, that's not the case. Thats probably something you want to make more clear. I believe in the solution I posted, you can simply replace the 'sid' with 'user_id' if you want simultaneous to reflect unique logins, but I havent tested it. Good luck, fun problem.

dermen
  • 5,252
  • 4
  • 23
  • 34