How to count concurrent events in a dataframe in one line?

Question

I have a dataset with phone calls. I want to count how many active calls there are for each record. I found this question but I'd like to avoid loops and functions.

Each call has a date, a start time and a end time.

The dataframe:

      start       end        date
0  09:17:12  09:18:20  2016-08-10
1  09:15:58  09:17:42  2016-08-11
2  09:16:40  09:17:49  2016-08-11
3  09:17:05  09:18:03  2016-08-11
4  09:18:22  09:18:30  2016-08-11

What I want:

      start       end        date  activecalls
0  09:17:12  09:18:20  2016-08-10            1
1  09:15:58  09:17:42  2016-08-11            1
2  09:16:40  09:17:49  2016-08-11            2
3  09:17:05  09:18:03  2016-08-11            3
4  09:18:22  09:18:30  2016-08-11            1

My code:

import pandas as pd

df = pd.read_clipboard(sep='\s\s+')

df['activecalls'] = df[(df['start'] <= df.loc[df.index]['start']) & \
                       (df['end'] > df.loc[df.index]['start']) & \
                       (df['date'] == df.loc[df.index]['date'])].count()

print(df)

What I get:

      start       end        date  activecalls
0  09:17:12  09:18:20  2016-08-10          NaN
1  09:15:58  09:17:42  2016-08-11          NaN
2  09:16:40  09:17:49  2016-08-11          NaN
3  09:17:05  09:18:03  2016-08-11          NaN
4  09:18:22  09:18:30  2016-08-11          NaN

Please check [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — jezrael, Sep 13 '16 at 10:13

jezrael · Accepted Answer · 2016-09-14T10:23:17.993

You can use:

#convert time and date to datetime
df['date_start'] = pd.to_datetime(df.start + ' ' + df.date)
df['date_end'] = pd.to_datetime(df.end + ' ' + df.date)
#remove columns
df = df.drop(['start','end','date'], axis=1)

Solution with loop:

active_events= []
for i in df.index:
    active_events.append(len(df[(df["date_start"]<=df.loc[i,"date_start"]) & 
                                (df["date_end"]> df.loc[i,"date_start"])]))
df['activecalls'] = pd.Series(active_events)
print (df)
           date_start            date_end  activecalls
0 2016-08-10 09:17:12 2016-08-10 09:18:20            1
1 2016-08-11 09:15:58 2016-08-11 09:17:42            1
2 2016-08-11 09:16:40 2016-08-11 09:17:49            2
3 2016-08-11 09:17:05 2016-08-11 09:18:03            3
4 2016-08-11 09:18:22 2016-08-11 09:18:30            1

Solution with merge

#cross join
df['tmp'] = 1
df1 = pd.merge(df,df.reset_index(),on=['tmp'])
df = df.drop('tmp', axis=1)
#print (df1)

#filtering by conditions
df1 = df1[(df1["date_start_x"]<=df1["date_start_y"])  
          (df1["date_end_x"]> df1["date_start_y"])]
print (df1)
          date_start_x          date_end_x  activecalls_x  tmp  index  \
0  2016-08-10 09:17:12 2016-08-10 09:18:20              1    1      0   
6  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      1   
7  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      2   
8  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      3   
12 2016-08-11 09:16:40 2016-08-11 09:17:49              2    1      2   
13 2016-08-11 09:16:40 2016-08-11 09:17:49              2    1      3   
18 2016-08-11 09:17:05 2016-08-11 09:18:03              3    1      3   
24 2016-08-11 09:18:22 2016-08-11 09:18:30              1    1      4   

          date_start_y          date_end_y  activecalls_y  
0  2016-08-10 09:17:12 2016-08-10 09:18:20              1  
6  2016-08-11 09:15:58 2016-08-11 09:17:42              1  
7  2016-08-11 09:16:40 2016-08-11 09:17:49              2  
8  2016-08-11 09:17:05 2016-08-11 09:18:03              3  
12 2016-08-11 09:16:40 2016-08-11 09:17:49              2  
13 2016-08-11 09:17:05 2016-08-11 09:18:03              3  
18 2016-08-11 09:17:05 2016-08-11 09:18:03              3  
24 2016-08-11 09:18:22 2016-08-11 09:18:30              1

#get size - active calls
print (df1.groupby(['index'], sort=False).size())
index
0    1
1    1
2    2
3    3
4    1
dtype: int64

df['activecalls'] = df1.groupby('index').size()
print (df)
           date_start            date_end  activecalls
0 2016-08-10 09:17:12 2016-08-10 09:18:20            1
1 2016-08-11 09:15:58 2016-08-11 09:17:42            1
2 2016-08-11 09:16:40 2016-08-11 09:17:49            2
3 2016-08-11 09:17:05 2016-08-11 09:18:03            3
4 2016-08-11 09:18:22 2016-08-11 09:18:30            1

Timings:

def a(df):
    active_events= []
    for i in df.index:
        active_events.append(len(df[(df["date_start"]<=df.loc[i,"date_start"]) & (df["date_end"]> df.loc[i,"date_start"])]))
    df['activecalls'] = pd.Series(active_events)
    return (df)

def b(df):
    df['tmp'] = 1
    df1 = pd.merge(df,df.reset_index(),on=['tmp'])
    df = df.drop('tmp', axis=1)
    df1 = df1[(df1["date_start_x"]<=df1["date_start_y"])  & (df1["date_end_x"]> df1["date_start_y"])]
    df['activecalls'] = df1.groupby('index').size()
    return (df)

print (a(df))
print (b(df))

In [160]: %timeit (a(df))
100 loops, best of 3: 6.76 ms per loop

In [161]: %timeit (b(df))
The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.61 ms per loop

I was hoping to be able to do it in one line. I tried the solution with the loop and it's working. Thank you for your help. Do you know why my code is not working? I also tried `df['activecalls'] = len(df[...])` but it didn't work. — Guillaume, Sep 14 '16 at 10:05
I think it is complicated problem, so solution is complicated too. And `df[(df['start'] <= df.loc[df.index]['start']) & \ (df['end'] > df.loc[df.index]['start']) & \ (df['date'] == df.loc[df.index]['date'])].count()` doesnt work, because you need compare each value of `start` column with all values of `start` and all values of `end`. So instead loops I use merge, but sometimes solution with looping is simplier and faster. — jezrael, Sep 14 '16 at 10:18
There's obviously something I misunderstand about pandas dataframes. I have to search about this. Thank you for your help. — Guillaume, Sep 19 '16 at 05:45

score 1 · Answer 2 · answered May 24 '22 at 20:00

As in jezrael's answer, first convert to datetime:

#convert time and date to datetime
df['date_start'] = pd.to_datetime(df.start + ' ' + df.date)
df['date_end'] = pd.to_datetime(df.end + ' ' + df.date)
#remove columns
df = df.drop(['start','end','date'], axis=1)

Then you can do a one-liner using apply:

df['activecalls'] = df.apply(
    lambda x: len(df[
        (df['date_start'] <= x['date_start']) & \
        (df['date_end'] > x['date_start'])]), 
    axis=1)

Which gives the required result

print(df)

    date_start            date_end         activecalls
0 2016-08-10 09:17:12 2016-08-10 09:18:20            1
1 2016-08-11 09:15:58 2016-08-11 09:17:42            1
2 2016-08-11 09:16:40 2016-08-11 09:17:49            2
3 2016-08-11 09:17:05 2016-08-11 09:18:03            3
4 2016-08-11 09:18:22 2016-08-11 09:18:30            1

How to count concurrent events in a dataframe in one line?

2 Answers2

Linked

Related