Faster Outer-Product-Like find closest lat long with function pandas

Question

I have a pandas dataframe A, with latitude longitudes.

import pandas as pd

df_a = pd.DataFrame([['b',1.591797,103.857887],
                 ['c',1.589416, 103.865322]],
                columns = ['place','lat','lng'])

I have another dataframe of locations B, also with latitude longitudes.

df_b = pd.DataFrame([['ref1',1.594832, 103.853703],
                 ['ref1',1.589749, 103.864678]],
                columns = ['place','lat','lng'])

For every row in A, i want to find the closest matching row in B (subject to distance limit). --> I already have a function that calculates the distance between two GPS pairs

intended output

# a list where each row is the corresponding closest index in B
In [13]: min_index_arr
Out[13]: [0, 1]

One way of doing this is:

def haversine(pair1, pair2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    lon1, lat1 = pair1
    lon2, lat2 = pair2
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

import operator
min_vals = []
for i in df_a.index:
    pair1 = df_a['lat'][i], df_a['lng'][i]
    dist_array = []
    for j in df_b.index:
        pair2 = df_b['lat'][j], df_b['lng'][j]
        dist = haversine(pair1, pair2)
        dist_array.append(dist)
    min_index, min_value = min(enumerate(dist_array), key=operator.itemgetter(1))
    min_vals.append(max_index)

But im sure there's a faster way to do this, it seems very similar to an outer product, except not a product, and using a function. Does anyone know how?

Are you sure you don't want `min(enumerate(dist_array), key=operator.itemgetter(1))`--i.e. closest is smallest distance? — DarrylG, May 16 '20 at 14:34
Assuming large datasets--perhaps this would help [Find the nearest point in distance for all the points in the dataset - Python](https://stackoverflow.com/questions/45127141/find-the-nearest-point-in-distance-for-all-the-points-in-the-dataset-python) or perhaps better [check answer by backbord](https://stackoverflow.com/questions/10549402/kdtree-for-longitude-latitude) — DarrylG, May 16 '20 at 14:41
This answer contains a vectorized implementation of a *minimum* distance calculation: https://stackoverflow.com/a/59010681/1718575 — NicholasM, May 16 '20 at 17:00
@NicholasM yes this was the approach I was looking for! thank you! — Wboy, May 17 '20 at 10:40

DarrylG · Answer 1 · 2020-05-16T16:48:28.160

Using an approach from KDTree for longitude/latitude

Based upon sklearn.balltree

Code

# Setup Balltree using df_b as reference dataset
bt = BallTree(np.deg2rad(df_b[['lat', 'lng']].values), metric='haversine')

# Setup distance queries
query_lats = df_a['lat']
query_lons = df_a['lng']

# Find closest city in reference dataset for each city in a
distances, indices = bt.query(np.deg2rad(np.c_[query_lats, query_lons]))

# Result
r_km = 6371
for p, d, i in zip(df_a['place'][:], distances.flatten(), indices.flatten()):
  print(f"Place {p} closest to {df_b['place'][i]} with distance {d*r_km:.4f} km")

Output

Place b closest to ref1 with distance 0.5746 km
Place c closest to ref2 with distance 0.0806 km

Hello, thanks for the answer! I'm aware of this method, but i was talking more about the abstract method of applying a function in an outer product-esuqe way, or a vector implementation — Wboy, May 17 '20 at 10:40

Faster Outer-Product-Like find closest lat long with function pandas

1 Answers1