0

I have a dataframe using Pandas in Python that contains latitude and longitude coordinates on each row. My goal is to add another column called "close_by" that contains a count of the number of other entries in the data set that are within 1 mile, using haversine.

I have seen other guides for similar problems like: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6 But they involve using df.apply() to update each row to add the distance between the coordinates and some static per-defined point. I have not had any luck finding or coming up with a solution.

Essentially, this is what I'm trying to optimize:

for index1, row1 in business_data.iterrows():
    for index2, row2 in business_data.iterrows():
        distance = mpu.haversine_distance((business_data.at[index1,'latitude'], business_data.at[index1,'longitude']), (business_data.at[index2,'latitude'], business_data.at[index2,'longitude']))
        distance = distance * 0.621371

        if distance <= 1:
            business_data.at[index1,'close_by'] = row1["close_by"] + 1

I have about 50,000 rows and on my computer it takes about 5 seconds per row.

Thank you for any suggestions!

1 Answers1

1

By the looks of it, mpu.haversine_distance() uses math instead of numpy functions, so it is not vectorizable.

Using this vectorized haversine distance function instead, you can easily vectorize your problem:

df = pd.DataFrame([
    {'latitude': 49.001, 'longitude': 11.0},
    {'latitude': 49.0, 'longitude': 11.0},
    {'latitude': 49.001, 'longitude': 11.001},
    {'latitude': -49.0, 'longitude': 11.0},
])


lon = df['longitude'].to_numpy()
lat = df['latitude'].to_numpy()

radius = 1.0

df['close_by'] = np.count_nonzero(haversine_np(lon, lat, lon[:, None], lat[:, None]) < radius, axis=0) - 1

df
#   latitude    longitude   nearby
# 0 49.001      11.000      2
# 1 49.000      11.000      2
# 2 49.001      11.001      2
# 3 -49.000     11.000      0
Nils Werner
  • 34,832
  • 7
  • 76
  • 98
  • Thank you for the solution, it answers my question. However, is there any way to make it a little more memory efficient? It seems to lock up my computer before throwing out a memory error after a while. – mhhhhhhh Apr 16 '19 at 10:53
  • You can process your dataframe [in batches](https://stackoverflow.com/a/25703030/636626) – Nils Werner Apr 16 '19 at 11:11