I have a dataframe using Pandas in Python that contains latitude and longitude coordinates on each row. My goal is to add another column called "close_by" that contains a count of the number of other entries in the data set that are within 1 mile, using haversine.
I have seen other guides for similar problems like: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6 But they involve using df.apply() to update each row to add the distance between the coordinates and some static per-defined point. I have not had any luck finding or coming up with a solution.
Essentially, this is what I'm trying to optimize:
for index1, row1 in business_data.iterrows():
for index2, row2 in business_data.iterrows():
distance = mpu.haversine_distance((business_data.at[index1,'latitude'], business_data.at[index1,'longitude']), (business_data.at[index2,'latitude'], business_data.at[index2,'longitude']))
distance = distance * 0.621371
if distance <= 1:
business_data.at[index1,'close_by'] = row1["close_by"] + 1
I have about 50,000 rows and on my computer it takes about 5 seconds per row.
Thank you for any suggestions!