0

Thank you for reading this. Currently I have a lot of latitude and longitude for many locations, and I need to create a matrix of distances for locations within 10km. (It's okay to fill the matrix with 0 distances between locations far more than 10km).

Data looks like:

place_coordinates=[[lat1, lon1],[lat2,lat2],...]

In this case, I'm using the code below to calculate it, but it takes so long time.

place_correlation = pd.DataFrame(
   squareform(pdist(place_coordinates, metric=haversine)),
   index=place_coordinates,
   columns=place_coordinates
)

When using squareform, I do not know how to not save or not calculate if it is outside 10km.

What is the fastest way?

Thank you in advance!

Nerxis
  • 3,452
  • 2
  • 23
  • 39
ddd
  • 37
  • 6
  • 3
    There are many ways to calculate distance between lat, lon coordinates. What metric and technique do you want to use? – artemis Jan 18 '22 at 01:39
  • check my answer, if you need better answer, please provide minimal working example and clarify what are you requirements (e.g. how long is the input array? Do you need to use haversine metric or are there any other possibilities suitable for your case?). – Nerxis Jan 18 '22 at 14:48

1 Answers1

1

First of all, do you need to use haversine metric for distance calculation? Which implementation do you use? If you would use e.g. euclidean metric your calculation would be faster but I guess you have good reasons why did you choose this metric.

In that case it may be better to use more optimal implementation of haversine (but I do not know which implementation you use). Check e.g. this SO question.

I guess you are using pdist and squareform from scipy.spatial.distance. When you look at the implementation that is behind (here) you will find they are using for loop. In that case you could rather use some vectorized implementation (e.g. this one from the linked question above).

import numpy as np
import itertools
from scipy.spatial.distance import pdist, squareform
from haversine import haversine  # pip install haversine

# original approach
place_coordinates = [(x, y) for x in range(10) for y in range(10)]
d = pdist(place_coordinates, metric=haversine)

# approach using combinations
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
d2 = [haversine(x, y) for (x, y) in place_coordinates_comb]

# just ensure that using combinations give you the same results as using pdist
np.testing.assert_array_equal(d, d2)

# vectorized version (taken from the link above)
# 1) create combination (note that haversine implementation from the link above takes (lon1, lat1, lon2, lat2) as arguments, that's why we do flatten
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
place_coordinates_comb_flatten = [(*x, *y) for (x, y) in place_coordinates_comb]
# 2) use format required by this impl
lon1, lat1, lon2, lat2 = np.array(place_coordinates_comb_flatten).T
# 3) vectorized comp
d_vect = haversine_np(lon1, lat1, lon2, lat2)

# it slightly differs from the original haversine package, but it's ok imo and vectorized implementation can be ofc improve to return exactly the same results
np.testing.assert_array_equal(d, d_vect)

When you compare times (absolute numbers will differ based on used machine):

%timeit pdist(place_coordinates, metric=haversine)
# 15.7 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit haversine_np(lon1, lat1, lon2, lat2)
# 241 µs ± 7.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That's quite a lot (~60x faster). When you have really long array (how many coordinates are you using?) this can help you a lot.

Finally, you can combine it using your code:

place_correlation = pd.DataFrame(squareform(d_vect), index=place_coordinates, columns=place_coordinates)

Additional improvement could be to use another metric (e.g. euclidean that will be faster) to quickly say which distances are outside 10km and then calculate haversine for the rest.

Nerxis
  • 3,452
  • 2
  • 23
  • 39
  • Thank you so much! Even in a situation where I need to calculate in km, if it is 60 times faster, I have to convert Euclidean to km. Thank you so much. – ddd Jan 19 '22 at 04:10
  • 1
    The difference is between the original implementation and vectorized implementation, sorry if I was not clear enough. But even when trying to switch the method to "euclidean" in the original implementation (`pdist` with for loop) it's even faster (it gives me something like 150x faster). I would need more measurements but switching to Euclidean will definitely help you. – Nerxis Jan 19 '22 at 07:22
  • 1
    The 150x difference was a huge difference for me. It really helped a lot! Thank you so much! – ddd Feb 06 '22 at 02:47