Fastest way to count rows that match based on values of each row in a pandas dataframe?

Question

I am using pandas to compare the values of each row to the rest of the dataframe and count the number of matches. I have a working solution, but it is quite slow, because it iterates through each row, makes the comparison, and counts. I am working with large data sets and would benefit from a quicker calculation. Here is my current solution:

def trip_stop_visits(df, row):
    return len(df.loc[(df['trip_id']  ==  row['trip_id']) &
                      (df['svc_date'] ==  row['svc_date']) &
                      (df['stop_id']  ==  row['stop_id']) &
                      (df['sch_time'] <=  row['sch_time'])])

bus_apc['visit_count'] = [trip_stop_visits(bus_apc, row) for index, row in bus_apc.iterrows()]

I am wondering if there is some way to make solutions like these vectorized. The key problem is that the comparison values change for each row, and most vectorized solutions I've found use preset comparison values. I am having trouble conceiving of any solution that does not start with "get this row and compare its values to the rest of the dataframe".

If there isn't a vectorized solution, I'd like to know how to improve the efficiency of iterating through every row. Thanks in advance.

You can do this by merging the DataFrame with itself `on=['trip_id', 'svc_date', 'stop_id']` and then check how many rows satisfy the condition with `sch_time_x <= sch_time_y`, a vectorized check. You'll need to bring along the index, exclude same row merges, and then join the result back to the original based on the Index. See, for instance, https://stackoverflow.com/questions/53506606/count-how-many-rows-have-date-within-date-range-of-each-row-for-each-id-pandas/53506993#53506993 — ALollz, Jun 03 '21 at 17:52
If you provide sample data (https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and your expected output for that sample, I'm sure someone will provide the necessary code to accomplish the calculation — ALollz, Jun 03 '21 at 17:55
Thanks @ALollz, that's a great idea. I think I can accomplish this based on what you've described and the code from your linked solution. Much appreciated. — dfarias, Jun 03 '21 at 18:55

Fastest way to count rows that match based on values of each row in a pandas dataframe?

0 Answers0