If you restructure your data, you can do a single pass through it. This is a good application for pandas.melt
:
# use a session id, as opposed to a user id, as a single user can log in multiple times:
df_test['sid'] = df_test.user_id + "-" + df_test.index.astype(str)
#df_test
# user_id login logout sid
#0 aa 2020-05-31 00:00:01 2020-05-31 00:00:31 aa-0
#1 bb 2020-05-31 00:01:01 2020-05-31 00:02:01 bb-1
#2 aa 2020-05-31 00:02:01 2020-05-31 00:06:03 aa-2
#3 cc 2020-05-31 00:03:01 2020-05-31 00:04:01 cc-3
#4 dd 2020-05-31 00:04:01 2020-05-31 00:34:01 dd-4
#5 aa 2020-05-31 00:05:01 2020-05-31 00:07:31 aa-5
#6 bb 2020-05-31 00:05:01 2020-05-31 00:06:01 bb-6
#7 aa 2020-05-31 00:05:01 2020-05-31 00:08:03 aa-7
#8 cc 2020-05-31 00:10:01 2020-05-31 00:40:01 cc-8
#9 dd 2020-05-31 00:20:01 2020-05-31 00:35:01 dd-9
# restructure the data, and sort it
df_chrono = pd.melt(df_test.set_index('sid'), value_vars=['login', 'logout'], ignore_index=False)
df_chrono = df_chrono.sort_values(by='value').reset_index()
#df_chrono:
# sid variable value
#0 aa-0 login 2020-05-31 00:00:01
#1 aa-0 logout 2020-05-31 00:00:31
#2 bb-1 login 2020-05-31 00:01:01
#3 aa-2 login 2020-05-31 00:02:01
#4 bb-1 logout 2020-05-31 00:02:01
#5 cc-3 login 2020-05-31 00:03:01
#6 dd-4 login 2020-05-31 00:04:01
#7 cc-3 logout 2020-05-31 00:04:01
#8 aa-5 login 2020-05-31 00:05:01
#9 bb-6 login 2020-05-31 00:05:01
#10 aa-7 login 2020-05-31 00:05:01
#11 bb-6 logout 2020-05-31 00:06:01
#12 aa-2 logout 2020-05-31 00:06:03
#13 aa-5 logout 2020-05-31 00:07:31
#14 aa-7 logout 2020-05-31 00:08:03
#15 cc-8 login 2020-05-31 00:10:01
#16 dd-9 login 2020-05-31 00:20:01
#17 dd-4 logout 2020-05-31 00:34:01
#18 dd-9 logout 2020-05-31 00:35:01
#19 cc-8 logout 2020-05-31 00:40:01
With the chronological data, we can pass through and easily track who's logged in at each iteration (Note: see update below for a more optimized version of the following loop)
# keep track of the current logins in simul_tracker, allowing for a single pass through the data
simul_track = {}
results = {"sid": [], "simul":[]}
for i,row in df_chrono.iterrows():
if row.variable=='login':
for sid in simul_track:
simul_track[sid] += 1
if row.sid not in simul_track:
simul_track[row.sid] = len(simul_track) # number of current logins
else:
results['simul'].append(simul_track.pop(row.sid))
results['sid'].append (row.sid)
#results
#{'sid': ['aa-0',
# 'bb-1',
# 'cc-3',
# 'bb-6',
# 'aa-2',
# 'aa-5',
# 'aa-7',
# 'dd-4',
# 'dd-9',
# 'cc-8'],
# 'simul': [0, 1, 2, 4, 6, 4, 4, 7, 2, 2]}
You can update the original dataframe using the results dict (note the results key 'sid'
is critical for alignment)
pd.merge(df_test, pd.DataFrame(results), on='sid')
# user_id login logout sid simul
#0 aa 2020-05-31 00:00:01 2020-05-31 00:00:31 aa-0 0
#1 bb 2020-05-31 00:01:01 2020-05-31 00:02:01 bb-1 1
#2 aa 2020-05-31 00:02:01 2020-05-31 00:06:03 aa-2 6
#3 cc 2020-05-31 00:03:01 2020-05-31 00:04:01 cc-3 2
#4 dd 2020-05-31 00:04:01 2020-05-31 00:34:01 dd-4 7
#5 aa 2020-05-31 00:05:01 2020-05-31 00:07:31 aa-5 4
#6 bb 2020-05-31 00:05:01 2020-05-31 00:06:01 bb-6 4
#7 aa 2020-05-31 00:05:01 2020-05-31 00:08:03 aa-7 4
#8 cc 2020-05-31 00:10:01 2020-05-31 00:40:01 cc-8 2
#9 dd 2020-05-31 00:20:01 2020-05-31 00:35:01 dd-9 2
Update
If there are a large number of users logged in at the same time, the above dictionary updates (for sid in simul_track: simul_track[sid] += 1
) can become a bottleneck. To get around that, one can use the following scheme:
import numpy as np
import time
t = time.time()
results = {"sid": [], "simul":[]}
n_records = len(df_chrono)
n_active = 0 # we will track the number of active logins here
# create an array for quick incremental updates
# Each session id gets a unique element in tracker
n_session = len(df_test)
tracker = np.zeros(n_session, dtype=np.uint)
# we create a 1-to-1 mapping from session id to the tracker array
idx_from_sid = {sid:i for i,sid in zip(df_test.index, df_test.sid)}
for i,row in df_chrono.iterrows():
idx = idx_from_sid[row.sid] # position in data array corresonding to this particular session id
# print progress
if i % 100==0:
perc_done = i / n_records * 100.
print("prog=%.2f%% (rt=%.3fsec)."% (perc_done, time.time()-t), flush=True, end='\r' )
if row.variable=='login':
# We track two quantities
# The first is how many additional users log in after current sid starts
tracker += 1 # never mind that we increment all values here; on the next line we override this particular sessions value
# the second is how many active users there are when this session id starts log in
tracker[idx] = n_active
n_active += 1
else:
n_active = n_active - 1
count = tracker[idx]
results['simul'].append(count)
results['sid'].append(row.sid)
print("")
Similar to one of the other answers, I timed this on data*30000 to simulate scaling for 300,000 rows, and was able to compute the simultaneous
values in ~110 sec.
Now, given my answer, you might still be interested in your original solution, and there were several optimizations you can make with that one as well. In particular, df_test.loc[~this_index]
: this need only be done once per iteration. Further, df.loc[this_index]
is a single row in the dataframe, and (df_test[this_index]['login'] <= logout) & (df_test[this_index]['logout'] >= login)
will always be True , hence no need to do the slicing:
df_test.reset_index(drop=True) # just in case
for i, row in df_test.iterrows():
df_test.loc[i, 'simultaneous'] = int(np.sum(
(df_test.login <= row.logout) & (df_test.logout >= row.login)
)) -1 # note the subtraction by one saves us from having to do df.loc[~this_index]
# alternatively, you can try to use numexpr to speed up the element wise comparisons
#in_lt_out = pd.eval('df_test.login <= row.logout', engine='numexpr')
#out_gt_in = pd.eval('df_test.logout >= row.login', engine='numexpr')
#simul = np.sum(pd.eval('in_lt_out & out_gt_in', engine='numexpr'))
#df_test.loc[i, 'simultaneous'] = int(simul-1)
Note, Im curious what you were trying to do with the .isin
call, it makes me think your definition of simultaneous was perhaps for unique users, however, here, and in your solution, that's not the case. Thats probably something you want to make more clear. I believe in the solution I posted, you can simply replace the 'sid' with 'user_id' if you want simultaneous to reflect unique logins, but I havent tested it. Good luck, fun problem.