3

I have the following dataframe in python pandas which is over 4 million rows,

ID      BOOKING_TIME    ENTRY_TIME              
23239   1/1/2020 0:00   1/1/2020 0:40                 
51042   1/1/2020 0:11   1/1/2020 0:42                 
73373   1/1/2020 0:15   1/1/2020 0:56                   
14222   1/1/2020 0:22   1/1/2020 1:00                   
27116   1/1/2020 0:55   1/1/2020 1:15                    
....

The two columns have been merged from 2 different sql tables.

The same user could make a booking further down the table again.

Each row shows a user creating a booking (entering the queue to enter the venue) and the time the user enters the venue.

I am trying to create a column that shows the number of people that are currently queuing based on the time a user makes a new booking.

When a booking is made, an entry under BOOKING_TIME is recorded and they join a queue.

When they enter the venue, the ENTRY_TIME is recorded and they leave the queue.

I want to do a running count of the people that are in the queue. Meaning that in each row, if a booking time is hit, it does +1 in the IN_QUEUE column and once the entry time (for any of the cases in the previous timestamps) is hit, I want to -1 from the IN_QUEUE.

It will create a new column like this. (Ignore the working column, that is just to show the process)


ID      BOOKING_TIME    ENTRY_TIME      IN_QUEUE           (working)
23239   1/1/2020 0:00   1/1/2020 0:40   1                   +1
51042   1/1/2020 0:11   1/1/2020 0:42   2                   +1
73373   1/1/2020 0:15   1/1/2020 0:56   3                   +1
14222   1/1/2020 0:22   1/1/2020 1:00   4                   +1
27116   1/1/2020 0:55   1/1/2020 1:15   3                   -1   +1
....

Row 5 will -1 first because before 0:55am on 1/1/2020, the person from the first row entered the venue at 0:40am.

I think I have to use an apply function with a conditional function on every row but I am not sure how it would work.

I was thinking maybe for each row, count all rows where BOOKING_TIME < CURRENT BOOKING_TIME and ENTRY_TIME > CURRENT BOOKING_TIME, but I'm not sure how to code it or if my logic is even correct.

I saw this post Pandas: conditional rolling count but I cant figure out how to make it work with my case.

anarchy
  • 3,709
  • 2
  • 16
  • 48

4 Answers4

3
import pandas as pd
import time

df = pd.DataFrame([[23239,'1/1/2020 0:00','1/1/2020 0:40'],[51042,'1/1/2020 0:11','1/1/2020 0:42'],
                   [73373,'1/1/2020 0:15','1/1/2020 0:56'],[14222,'1/1/2020 0:22','1/1/2020 1:00'],
                   [27116,'1/1/2020 0:55','1/1/2020 1:15']],columns = ['ID','BOOKING_TIME','ENTRY_TIME'])
df = df.sort_values(by='ENTRY_TIME')
# Copy 1000 times
df = pd.concat([df for i in range(1000)])
df['BOOKING_TIME'] = pd.to_datetime(df['BOOKING_TIME'], format='%d/%m/%Y %H:%M')
df['ENTRY_TIME'] = pd.to_datetime(df['ENTRY_TIME'], format='%d/%m/%Y %H:%M')

# improvement code   
start =time.time() 
df['IN_QUEUE'] = df.index.map(lambda index_value: (df['ENTRY_TIME'].values[:index_value+1] > df['BOOKING_TIME'].values[index_value]).sum())
end = time.time()
print('Running time: %s Seconds'%(end-start))
# Running time: 0.04886770248413086 Seconds

Numpy is the foundation of pandas, Using numpy will be much faster. Does this meet your requirements.

lazy
  • 744
  • 3
  • 13
1

I think the strategy here should be to treat bookings and entries as dissimilar events. Suppose data is the name of your dataframe with your bookings and entries:

bookings = pd.Series(1, index=data.BOOKING_TIME)
entries = pd.Series(-1, index=data.ENTRY_TIME)
events = pd.concat([bookings, entries]).sort_index()

in_queue = events.cumsum()
in_queue = in_queue.groupby(in_queue.index).min()
in_queue.name = 'IN_QUEUE'

data = data.merge(in_queue, left_on='BOOKING_TIME', right_index=True)

Note the special care we're taking when we group in_queue by its index. We need to do this in case any timestamps are repeated between BOOKING_TIME and ENTRY_TIME to prevent duplicated rows when we merge back in. It's possible min is not the treatment you want, but that's usage specific.

Kyle Parsons
  • 1,475
  • 6
  • 14
1

I have done it with a function and a loop:

def entered_count_till_booking(booking_time, entry_time_series=df['ENTRY_TIME']):
    return sum(booking_time>entry_time_series)
in_queue = []
book_cnt = 1
for tm in df['BOOKING_TIME']:
    in_queue.append(book_cnt - entered_count_till_booking(tm))
    book_cnt +=1
df['in_queue'] = in_queue
Muhammad Rasel
  • 704
  • 4
  • 9
1

An apply() gives nice clean code but there is almost certainly a higher performance approach.

# Minimal example dataframe
df = pandas.DataFrame({'ID':[23239,51042,73373,14222,27116],'BOOKING_TIME': ['1/1/2020 0:00','1/1/2020 0:11','1/1/2020 0:15','1/1/2020 0:22','1/1/2020 0:55'], 'ENTRY_TIME': ['1/1/2020 0:40','1/1/2020 0:42','1/1/2020 0:56','1/1/2020 1:00','1/1/2020 1:15']})
df['BOOKING_TIME'] = pandas.to_datetime(df['BOOKING_TIME'], format='%d/%m/%Y %H:%M')
df['ENTRY_TIME'] = pandas.to_datetime(df['ENTRY_TIME'], format='%d/%m/%Y %H:%M')

# Function to count how many in the queue for each booking time
#  NOTE: I am unsure whether OP wants the second comparison to be < or <=
def count_queue(x, bookings, entries):
    return (bookings <= x).sum() - (entries <= x).sum()

# Apply the function
df['IN_QUEUE'] = df['BOOKING_TIME'].apply(count_queue, bookings=df['BOOKING_TIME'], entries=df['ENTRY_TIME'])

Result:

IN_QUEUE
0 1
1 2
2 3
3 4
4 3
pbarber
  • 139
  • 2
  • 8