Efficient way to calculate distance matrix given latitude and longitude data in Python

Question

I have data for latitude and longitude, and I need to calculate distance matrix between two arrays containing locations. I used this This to get distance between two locations given latitude and longitude.

Here is an example of my code:

import numpy as np
import math

def get_distances(locs_1, locs_2):
    n_rows_1 = locs_1.shape[0]
    n_rows_2 = locs_2.shape[0]
    dists = np.empty((n_rows_1, n_rows_2))
    # The loops here are inefficient
    for i in xrange(n_rows_1):
        for j in xrange(n_rows_2):
            dists[i, j] = get_distance_from_lat_long(locs_1[i], locs_2[j])
    return dists


def get_distance_from_lat_long(loc_1, loc_2):

    earth_radius = 3958.75

    lat_dif = math.radians(loc_1[0] - loc_2[0])
    long_dif = math.radians(loc_1[1] - loc_2[1])
    sin_d_lat = math.sin(lat_dif / 2)
    sin_d_long = math.sin(long_dif / 2)
    step_1 = (sin_d_lat ** 2) + (sin_d_long ** 2) * math.cos(math.radians(loc_1[0])) * math.cos(math.radians(loc_2[0])) 
    step_2 = 2 * math.atan2(math.sqrt(step_1), math.sqrt(1-step_1))
    dist = step_2 * earth_radius

    return dist

My expected output is this:

>>> locations_1 = np.array([[34, -81], [32, -87], [35, -83]])
>>> locations_2 = np.array([[33, -84], [39, -81], [40, -88], [30, -80]])
>>> get_distances(locations_1, locations_2)
array([[ 186.13522573,  345.46610882,  566.23466349,  282.51056676],
       [ 187.96657622,  589.43369894,  555.55312473,  436.88855214],
       [ 149.5853537 ,  297.56950329,  440.81203371,  387.12153747]])

Performance is important for me, and one thing I could do is use Cython to speed up the loops, but it would be nice if I don't have to go there.

Is there a module that can do something like this? Or any other solution?

For starters, there are many things you compute multiple times - like the conversion from degrees to radians, etc. Usually, `*0.5` is faster than `/2` but I don't know how much that matters. But I guess the real question is - is the looping the thing that is costing time, or the time spent in the function? Did you try to benchmark this at all? Benchmarking is always the first step... — Floris, Oct 16 '13 at 20:37
Do you have numba by chance? How large is you actual system? — Daniel, Oct 16 '13 at 21:04

score 13 · Accepted Answer · answered Oct 16 '13 at 21:32

There's a lot of suboptimal things in the Haversine equations you are using. You can trim some of that and minimize the number of sines, cosines and square roots you need to calculate. The following is the best I have been able to come up with, and on my system runs about 5x faster than Ophion's code (which does mostly the same as far as vectorization goes) on two random arrays of 1000 and 2000 elements:

def spherical_dist(pos1, pos2, r=3958.75):
    pos1 = pos1 * np.pi / 180
    pos2 = pos2 * np.pi / 180
    cos_lat1 = np.cos(pos1[..., 0])
    cos_lat2 = np.cos(pos2[..., 0])
    cos_lat_d = np.cos(pos1[..., 0] - pos2[..., 0])
    cos_lon_d = np.cos(pos1[..., 1] - pos2[..., 1])
    return r * np.arccos(cos_lat_d - cos_lat1 * cos_lat2 * (1 - cos_lon_d))

If you feed it your two arrays "as is" it will complain, but that's not a bug, it's a feature. Basically, this function computes the distance on a sphere over the last dimension, and broadcasts on the rest. So you can get what you are after as:

>>> spherical_dist(locations_1[:, None], locations_2)
array([[ 186.13522573,  345.46610882,  566.23466349,  282.51056676],
       [ 187.96657622,  589.43369894,  555.55312473,  436.88855214],
       [ 149.5853537 ,  297.56950329,  440.81203371,  387.12153747]])

But it could also be used to calculate the distances between two lists of points, i.e.:

>>> spherical_dist(locations_1, locations_2[:-1])
array([ 186.13522573,  589.43369894,  440.81203371])

Or between two single points:

>>> spherical_dist(locations_1[0], locations_2[0])
186.1352257300577

This is inspired on how gufuncs work, and once you get used to it, I have found it to be a wonderful "swiss army knife" coding style, that lets you reuse a single function in lots of different settings.

In general is it faster to use `pos1=np.deg2rad(pos1)` or `pos1 = pos1 * np.pi / 180`? — rtrwalker, Oct 16 '13 at 22:02
`np.deg2rad` is faster, but you can get both approaches to run equally fast if you precalculate the conversion factor, i.e. `k = np.pi / 180; pos1 = pos1 * k`. In any case, the time spent there is irrelevant to what goes on elsewhere in the function call. — Jaime, Oct 16 '13 at 22:21
I have not seen this trick `pos1[..., 0]` before, what is this kind of indexing called? — Akavall, Oct 16 '13 at 22:56
`...` is the `Ellipsis` object, its used is explained in [the docs](http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#basic-slicing). Basically it is equivalent to as many `:` as needed to have an index for every dimension of the array, and the simplest way of indexing the last dimensions. — Jaime, Oct 16 '13 at 23:11
A bit late to the party, but for anybody getting confused about the results: The result is in miles. To get kilometres, simply replace r with 6357 — C Hecht, Sep 08 '22 at 09:01
For the mean radius in km 6371 is used usually. https://en.wikipedia.org/wiki/Earth_radius — C. Yduqoli, Feb 14 '23 at 14:13

score 6 · Answer 2 · edited Oct 16 '13 at 21:43

It is more efiicient when using meshgrid to replace the double for loop:

import numpy as np

earth_radius = 3958.75

def get_distances(locs_1, locs_2):
   lats1, lats2 = np.meshgrid(locs_1[:,0], locs_2[:,0])
   lons1, lons2 = np.meshgrid(locs_1[:,1], locs_2[:,1])

   lat_dif = np.radians(lats1 - lats2)
   long_dif = np.radians(lons1 - lons2)

   sin_d_lat = np.sin(lat_dif / 2.)
   sin_d_long = np.sin(long_dif / 2.)

   step_1 = (sin_d_lat ** 2) + (sin_d_long ** 2) * np.cos(np.radians(lats1[0])) * np.cos(np.radians(lats2[0])) 
   step_2 = 2 * np.arctan2(np.sqrt(step_1), np.sqrt(1-step_1))

   dist = step_2 * earth_radius

   return dist

Daniel · Answer 3 · 2013-10-16T21:39:15.110

This is simply vectorizing your code:

def new_get_distances(loc1, loc2):
    earth_radius = 3958.75

    locs_1 = np.deg2rad(loc1)
    locs_2 = np.deg2rad(loc2)

    lat_dif = (locs_1[:,0][:,None]/2 - locs_2[:,0]/2)
    lon_dif = (locs_1[:,1][:,None]/2 - locs_2[:,1]/2)

    np.sin(lat_dif, out=lat_dif)
    np.sin(lon_dif, out=lon_dif)

    np.power(lat_dif, 2, out=lat_dif)
    np.power(lon_dif, 2, out=lon_dif)

    lon_dif *= ( np.cos(locs_1[:,0])[:,None] * np.cos(locs_2[:,0]) )
    lon_dif += lat_dif

    np.arctan2(np.power(lon_dif,.5), np.power(1-lon_dif,.5), out = lon_dif)
    lon_dif *= ( 2 * earth_radius )

    return lon_dif

locations_1 = np.array([[34, -81], [32, -87], [35, -83]])
locations_2 = np.array([[33, -84], [39, -81], [40, -88], [30, -80]])
old = get_distances(locations_1, locations_2)

new = new_get_distances(locations_1,locations_2)

np.allclose(old,new)
True

If we look at timings:

%timeit new_get_distances(locations_1,locations_2)
10000 loops, best of 3: 80.6 µs per loop

%timeit get_distances(locations_1,locations_2)
10000 loops, best of 3: 74.9 µs per loop

It is actually slower for a small example; however, lets look at a larger example:

locations_1 = np.random.rand(1000,2)

locations_2 = np.random.rand(1000,2)

%timeit get_distances(locations_1,locations_2)
1 loops, best of 3: 5.84 s per loop

%timeit new_get_distances(locations_1,locations_2)
10 loops, best of 3: 149 ms per loop

We now have a speedup of 40x. Can probably squeeze some more speed in a few places.

Edit: Made a few updates to cut out redundant places and make it clear that we are not altering the original location arrays.

score 4 · Answer 4 · answered Oct 16 '13 at 22:07

4

Does the Haversine formula provide good enough accuracy for your use? It can be off by quite a bit. I think you'd be able to get both accuracy and speed if you use proj.4, in particular the python bindings, pyproj. Note that pyproj can work directly on numpy arrays of coordinates.

answered Oct 16 '13 at 22:07

Alex I

19,689
9
86
158

Aren't the Haversine formulas just spherical trigonometry? What's the source of errors? The lack of sphericity of Earth? Do you know what corrections does `pyproj` use? – Jaime Oct 16 '13 at 22:39
3

@Jaime: Indeed, lack of sphericity, and also the [datum](http://en.wikipedia.org/wiki/Datum_%28geodesy%29) corrections which refer to deviations from perfect ellipsoid shape. This may seem like not a big deal and it probably isn't if you're just doing (eg) a sort by distance, but it is a huge deal if you're navigating. – Alex I Oct 16 '13 at 23:13
4

@Jamie - Just for comparison, in this particular case, the differences between the spherical results and the results using the WGS84 datum are ~0.5 miles (so less than 1%). That's insignificant for some things but important for others. A classic example of this is people thinking that a lat, long uniquely specifies a location. Without knowing the the datum and ellipsoid that the lat, long was referenced to, you can only get to within ~1km. At any rate, I'm rambling, but proj.4 is the de-facto standard library for this sort of thing, and `pyproj` is a particularly nice python binding for it. – Joe Kington Oct 17 '13 at 02:43
This is interesting and useful information (+1), but for my purposes Haversine formula should be fine. – Akavall Oct 17 '13 at 02:51

Efficient way to calculate distance matrix given latitude and longitude data in Python

4 Answers4

Linked