What is the most efficient way to count the number of instances occurring within a time frame in python?

Question

I am trying to run a simple count function which runs a dataframe of event times (specifically surgeries) against another dataframe of shift time frames, and returns a list of how many events occur during each shift. These csvs are thousands of rows, though, so while the way I have it set up currently works, it takes forever. This is what I have:

numSurgeries = [0 for shift in range(len(df.Date))]

for i in range(len(OR['PATIENT_IN_ROOM_DTTM'])):
    for shift in range(len(df.DateTime)):
        if OR['PATIENT_IN_ROOM_DTTM'][i] >= df.DateTime[shift] and OR['PATIENT_IN_ROOM_DTTM'][i] < df.DateTime[shift+1]:
            numSurgeries[shift] += 1

So it loops through each event and checks to see which shift time frame it is in, then increments the count for that time frame. Logical, works, but definitely not efficient.

EDIT:

Example of OR data file

Example of df data file

Welcome to StackOverflow! Please read about [how to ask a question](https://stackoverflow.com/help/how-to-ask) (particularly [how to create a good example](https://stackoverflow.com/help/mcve)) in order to get good responses. — Alex, Jan 25 '18 at 17:01
In your case please provide an example DataFrame (`OR.head()`) and the expected output. — Alex, Jan 25 '18 at 17:01

score 0 · Accepted Answer · answered Jan 25 '18 at 17:11

0

Without example data, it's not absolutely clear what you want. But this should help you vectorise:

numSurgeries = {shift: np.sum((OR['PATIENT_IN_ROOM_DTTM'] >= df.DateTime[shift]) & \
                       (OR['PATIENT_IN_ROOM_DTTM'] < df.DateTime[shift+1])) \
                       for shift in range(len(df.Date))}

The output is a dictionary mapping integer shift to numSurgeries.

answered Jan 25 '18 at 17:11

jpp

159,742
34
281
339

Thank you! I have also attached example data as pictures in the edit – Charles Bushrow Jan 25 '18 at 18:00
@CharlesBushrow, OK, the main thing you need to make sure is columns are appropriately type-casted in pandas, otherwise the comparison operations won't work. – jpp Jan 25 '18 at 18:03
Great! Yeah with this setup it's flying now – Charles Bushrow Jan 25 '18 at 18:59
@CharlesBushrow, excellent - do accept the answer if it helped. – jpp Jan 25 '18 at 19:30

Adam J. Gramling · Answer 2 · 2018-01-25T19:01:11.100

As mentioned above, it is hard to answer without example data.

However, a boolean mask sounds fitting. See Select dataframe rows between two dates.

Create a date mask from shift, we'll call the start and end dates start_shift and end_shift respectively. These should be in datetime format.

date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)

Locate all values in df that fit this range.

df_shift = df.loc[date_mask]

Count the number of instances in the new df_shift.

num_surgeries = len(df_shift.index())

Cycle through all shifts.

def count_shifts(df, shift, results_df, start_shift, end_shift):

    date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
    df_shift = df.loc[date_mask]
    num_surgeries = len(df_shift.index())

    return(num_surgeries)

# iterates through df and applies the above function to every row
results_df['num_surgeries'] = results_df.apply(calculate_num_surgeries,axis=1)

Also remember to name variables according to PEP8 Style Guide! Camelcase is not recommended in Python.

That makes sense! I attached example data as pictures in the edit — Charles Bushrow, Jan 25 '18 at 18:02

What is the most efficient way to count the number of instances occurring within a time frame in python?

2 Answers2