1

I have a pandas dataframe A, with latitude longitudes.

import pandas as pd

df_a = pd.DataFrame([['b',1.591797,103.857887],
                 ['c',1.589416, 103.865322]],
                columns = ['place','lat','lng'])

I have another dataframe of locations B, also with latitude longitudes.

df_b = pd.DataFrame([['ref1',1.594832, 103.853703],
                 ['ref1',1.589749, 103.864678]],
                columns = ['place','lat','lng'])

For every row in A, i want to find the closest matching row in B (subject to distance limit). --> I already have a function that calculates the distance between two GPS pairs

intended output

# a list where each row is the corresponding closest index in B
In [13]: min_index_arr
Out[13]: [0, 1]

One way of doing this is:

def haversine(pair1, pair2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    lon1, lat1 = pair1
    lon2, lat2 = pair2
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

import operator
min_vals = []
for i in df_a.index:
    pair1 = df_a['lat'][i], df_a['lng'][i]
    dist_array = []
    for j in df_b.index:
        pair2 = df_b['lat'][j], df_b['lng'][j]
        dist = haversine(pair1, pair2)
        dist_array.append(dist)
    min_index, min_value = min(enumerate(dist_array), key=operator.itemgetter(1))
    min_vals.append(max_index)

But im sure there's a faster way to do this, it seems very similar to an outer product, except not a product, and using a function. Does anyone know how?

Wboy
  • 2,452
  • 2
  • 24
  • 45
  • Are you sure you don't want `min(enumerate(dist_array), key=operator.itemgetter(1))`--i.e. closest is smallest distance? – DarrylG May 16 '20 at 14:34
  • @DarrylG whoops sorry good catch! did this in a rush – Wboy May 16 '20 at 14:34
  • 1
    Assuming large datasets--perhaps this would help [Find the nearest point in distance for all the points in the dataset - Python](https://stackoverflow.com/questions/45127141/find-the-nearest-point-in-distance-for-all-the-points-in-the-dataset-python) or perhaps better [check answer by backbord](https://stackoverflow.com/questions/10549402/kdtree-for-longitude-latitude) – DarrylG May 16 '20 at 14:41
  • scipy.spatial.kdtree – Joe May 16 '20 at 15:35
  • 1
    This answer contains a vectorized implementation of a *minimum* distance calculation: https://stackoverflow.com/a/59010681/1718575 – NicholasM May 16 '20 at 17:00
  • @NicholasM yes this was the approach I was looking for! thank you! – Wboy May 17 '20 at 10:40

1 Answers1

1

Using an approach from KDTree for longitude/latitude

Based upon sklearn.balltree

Code

# Setup Balltree using df_b as reference dataset
bt = BallTree(np.deg2rad(df_b[['lat', 'lng']].values), metric='haversine')

# Setup distance queries
query_lats = df_a['lat']
query_lons = df_a['lng']

# Find closest city in reference dataset for each city in a
distances, indices = bt.query(np.deg2rad(np.c_[query_lats, query_lons]))

# Result
r_km = 6371
for p, d, i in zip(df_a['place'][:], distances.flatten(), indices.flatten()):
  print(f"Place {p} closest to {df_b['place'][i]} with distance {d*r_km:.4f} km")

Output

Place b closest to ref1 with distance 0.5746 km
Place c closest to ref2 with distance 0.0806 km
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • Hello, thanks for the answer! I'm aware of this method, but i was talking more about the abstract method of applying a function in an outer product-esuqe way, or a vector implementation – Wboy May 17 '20 at 10:40