I am using pandas to compare the values of each row to the rest of the dataframe and count the number of matches. I have a working solution, but it is quite slow, because it iterates through each row, makes the comparison, and counts. I am working with large data sets and would benefit from a quicker calculation. Here is my current solution:
def trip_stop_visits(df, row):
return len(df.loc[(df['trip_id'] == row['trip_id']) &
(df['svc_date'] == row['svc_date']) &
(df['stop_id'] == row['stop_id']) &
(df['sch_time'] <= row['sch_time'])])
bus_apc['visit_count'] = [trip_stop_visits(bus_apc, row) for index, row in bus_apc.iterrows()]
I am wondering if there is some way to make solutions like these vectorized. The key problem is that the comparison values change for each row, and most vectorized solutions I've found use preset comparison values. I am having trouble conceiving of any solution that does not start with "get this row and compare its values to the rest of the dataframe".
If there isn't a vectorized solution, I'd like to know how to improve the efficiency of iterating through every row. Thanks in advance.