3

I am trying to find the minimum distance between each customer to the store. Currently, there are ~1500 stores and ~670K customers in my data. I have to calculate the geo distance for 670K customers x 1500 stores and find the minimum distance for each customer.

I have created the haversine function below:

import numpy as np
def haversine_np(lon1, lat1, lon2, lat2):

    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    miles = 6367 * c/1.609
    return miles

and my data set looks like below, 1 data frame for the customer (cst_geo) and 1 data frame for the store (store_geo). The numbers below are made up as I can't share the snippet of the real data:

Customer ID Latitude Longitude
A123 39.342 -40.800
B456 38.978 -41.759
C789 36.237 -77.348
Store ID Latitude Longitude
S1 59.342 -60.800
S2 28.978 -71.759
S3 56.237 -87.348

I wrote a for loop below to attempt this calculation but it took >8 hours to run. I have tried to use deco but wasn't able to optimize it any further.

mindist = []
for i in cst_geo.index:
    dist = []
    for j in store_geo.index:
        dist.append(haversine_np(cst_geo.longitude[i], cst_geo.latitude[i],
                                 store_geo.longitude[j], store_geo.latitude[j]))    
    mindist.append(min(dist))
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
  • Use [`scipy.pdist()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html)? – Pranav Hosangadi Oct 13 '20 at 16:53
  • You could use a k-d tree. The cKDTree from scipy.spatial is a fast implementation of this data structure. – Rivers Oct 13 '20 at 17:26
  • 2
    See if these help out - https://stackoverflow.com/a/57696524/, https://stackoverflow.com/a/34557996/, https://stackoverflow.com/a/44682708/, https://stackoverflow.com/a/34517218/. – Divakar Oct 13 '20 at 17:28
  • 1
    Thank you so much for these links! this link works perfectly https://stackoverflow.com/questions/44681828/efficient-computation-of-minimum-of-haversine-distances/44682708#44682708 – Jennifer Wu Oct 13 '20 at 18:29

1 Answers1

2

This can be done with geopy

from geopy.distance import geodesic

customers = [
    (39.342, -40.800),
    (38.978, -41.759),
    (36.237, -77.348),
]
stores = [
    (59.342, -60.800),
    (28.978, -71.759),
    (56.237, -87.348),
]
matrix = [[None] * len(customers)] * len(stores)
for index, i in enumerate(customers):
    for j_index, j in enumerate(stores):
        matrix[j_index][index] = geodesic(i, j).meters

output

[[3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098], 
[3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098],
 [3861568.3809260903, 3831526.290564832, 2347407.258650098, 2347407.258650098]]

you can also have the distance in others units with kilometers, miles, feet ...

AlexisG
  • 2,476
  • 3
  • 11
  • 25
  • The results are quite incorrect; you applied an invalid distance function. You projected the global coordinates to a Cartesian plane. Your results are in incorrectly "stretched" degrees of arc, rather than miles. – Prune Oct 13 '20 at 17:09
  • You're right, I've updated my answer – AlexisG Oct 13 '20 at 18:22
  • Much better! I reversed my vote. – Prune Oct 13 '20 at 18:53
  • Hi @AlexisG and Prune, thank you for your inputs, I tried the code above but it takes really long to run still (>1 hr with less than half completed), so instead i tried nested loop with vectorizer. It completed in less than 1 minute.. It's based on this link (https://stackoverflow.com/questions/44681828/efficient-computation-of-minimum-of-haversine-distances). Thanks again :) – Jennifer Wu Oct 13 '20 at 20:12