Python: speeding up geographic comparison

Question

I've written some code that includes a nested loop where the inner loop is executed about 1.5 million times. I have a function in this loop that I'm trying to optimize. I've done some work, and got some results, but I need a little input to check if what I'm doing is sensible.

Some background:

I have two collections of geographic points (latitude, longitude), one relatively small collection and one relatively huge collection. For every point in the small collection, I need to find the closest point in the large collection.

The obvious way to do this would be to use the haversine formula. The benefit here is that the distances are definitely accurate.

from math import radians, sin, cos, asin, sqrt

def haversine(point1, point2):
    """Gives the distance between two points on earth.
    """
    earth_radius_miles = 3956
    lat1, lon1 = (radians(coord) for coord in point1)
    lat2, lon2 = (radians(coord) for coord in point2)
    dlat, dlon = (lat2 - lat1, lon2 - lon1)
    a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
    great_circle_distance = 2 * asin(min(1,sqrt(a)))
    d = earth_radius_miles * great_circle_distance
    return d

However, running this 1.5 million times takes about 9 seconds on my machine (according to timeit). Since having an accurate distance is unimportant, rather I only need to find the closest point, I decided to try some other functions.

A simple implementation of the pythagorean theorem gives me a speedup of about 30%. Thinking that I can do better, I wrote the following:

def dumb(point1, point2):
    lat1, lon1 = point1
    lat2, lon2 = point2
    d = abs((lat2 - lat1) + (lon2 - lon1))

which gives me a factor of 10 improvement. However, now I'm worried that this will not preserve the triangle inequality.

So, my final question is two fold: I'd like to have a function that runs as fast as dumb but still be correct. Will dumb work? If not, any suggestions on how to improve my haversine function?

score 22 · Answer 1 · edited Feb 15 '23 at 19:00

This is the kind of calculation that numpy is really good at. Rather than looping over the entire large set of coordinates, you can compute the distance between a single point and the entire dataset in a single calculation. With my tests below, you can get an order of magnitude speed increase.

Here's some timing tests with your haversine method, your dumb method (not really sure what that does) and my numpy haversine method. It computes the distance between two points - one in Virginia and one in California that are 2293 miles away.

from math import radians, sin, cos, asin, sqrt, pi, atan2
import numpy as np
import itertools

earth_radius_miles = 3956.0

def haversine(point1, point2):
    """Gives the distance between two points on earth.
    """
    lat1, lon1 = (radians(coord) for coord in point1)
    lat2, lon2 = (radians(coord) for coord in point2)
    dlat, dlon = (lat2 - lat1, lon2 - lon1)
    a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
    great_circle_distance = 2 * asin(min(1,sqrt(a)))
    d = earth_radius_miles * great_circle_distance
    return d

def dumb(point1, point2):
    lat1, lon1 = point1
    lat2, lon2 = point2
    d = abs((lat2 - lat1) + (lon2 - lon1))
    return d
    
def get_shortest_in(needle, haystack):
    """needle is a single (lat,long) tuple.
        haystack is a numpy array to find the point in
        that has the shortest distance to needle
    """
    dlat = np.radians(haystack[:,0]) - radians(needle[0])
    dlon = np.radians(haystack[:,1]) - radians(needle[1])
    a = np.square(np.sin(dlat/2.0)) + cos(radians(needle[0])) * np.cos(np.radians(haystack[:,0])) * np.square(np.sin(dlon/2.0))
    great_circle_distance = 2 * np.arcsin(np.minimum(np.sqrt(a), np.repeat(1, len(a))))
    d = earth_radius_miles * great_circle_distance
    return np.min(d)
    
    
x = (37.160316546736745, -78.75)
y = (39.095962936305476, -121.2890625)

def dohaversine():
    for i in xrange(100000):
        haversine(x,y)
        
def dodumb():
    for i in xrange(100000):
        dumb(x,y)
        
lots = np.array(list(itertools.repeat(y, 100000)))
def donumpy():
    get_shortest_in(x, lots)

from timeit import Timer
print 'haversine distance =', haversine(x,y), 'time =',
print Timer("dohaversine()", "from __main__ import dohaversine").timeit(100)
print 'dumb distance =', dumb(x,y), 'time =',
print Timer("dodumb()", "from __main__ import dodumb").timeit(100)
print 'numpy distance =', get_shortest_in(x, lots), 'time =',
print Timer("donumpy()", "from __main__ import donumpy").timeit(100)

And here's what it prints:

haversine distance = 2293.13242188 time = 44.2363960743
dumb distance = 40.6034161104 time = 5.58199882507
numpy distance = 2293.13242188 time = 1.54996609688

The numpy method takes 1.55 seconds to compute the same number of distance calculations as it takes 44.24 seconds to compute with your function method. You could probably get more of a speedup by combining some of the numpy functions into a single statement, but it would become a long, hard-to-read line.

This is great advice, and I appreciate it. Unfortunately, I forgot to mention I'm working with IronPython right now (no numpy), but I will file this away for future reference. — Wilduck, Jul 12 '11 at 13:32

score 7 · Accepted Answer · answered Jul 11 '11 at 21:01

7

You can consider some kind of graphical hashing, i.e. find closest points fast and then calculate on them. For example, you can create a uniform grid, and distribute the points (of the large collection) to be in the bins created by the grid.

Now, having a point from the small collection, you'll need to process much smaller amount of points (i.e. those in relevant bins only)

answered Jul 11 '11 at 21:01

Drakosha

11,925
4
39
52

While this completely sidesteps the question I asked, I have to accept it, since it is what I ended up doing. Thanks for providing this perspective. – Wilduck Jul 12 '11 at 13:30
4

algorithmic optimisation is always the best answer - the grid suggestion is very good, a special case of the quadtree (or octree) spatial partitioning scheme, and one which is relatively easy to implement. – jheriko Jul 13 '11 at 01:23

score 2 · Answer 3 · edited Oct 10 '12 at 19:37

I had a similar problem and decided to knock up a Cython function. On my 2008 MBP it can do about 1.2M iterations per second. Taking the type checking out speeds up a further 25%. No doubt further optimisations are possible (at the expense of clarity).

You may also want to check out the scipy.spatial.distance.cdist function.

from libc.math cimport sin, cos, acos

def distance(float lat1, float lng1, float lat2, float lng2):
    if lat1 is None or lat2 is None or lng1 is None or lng2 is None: return None
    cdef float phi1
    cdef float phi2
    cdef float theta1
    cdef float theta2
    cdef float c
    cdef float arc

    phi1 = (90.0 - lat1)*0.0174532925
    phi2 = (90.0 - lat2)*0.0174532925
    theta1 = lng1*0.0174532925
    theta2 = lng2*0.0174532925

    c = (sin(phi1)*sin(phi2)*cos(theta1 - theta2) + cos(phi1)*cos(phi2))
    arc = acos( c )
    return arc*6371

score 2 · Answer 4 · answered Jul 11 '11 at 21:28

The formula you wrote (d=abs(lat2-lat1)+(lon2-lon1)) does NOT preserve triangle inequality: if you find lat, lon for wich d is min, you don't find the closest point, but the point closest to two diagonal straight lines crossing in the point you are checking!

I think you should order the large ammount of points by lat and lon (this means: (1,1),(1,2), (1,3)...(2,1),(2,2) etc. Then use the gunner method to find the some of the closest points in terms of latitude and longitude (this should be really fast, it is going to take cpu time proportional to ln2 (n) where n is the number of points). You can do this easily, on example: choose all the points in a square of 10x10 around the point you are going to check, this means: find all the points that are from -10 to +10 in lat (gunner method) and again those that are from -10 to +10 in lon (gunner method). Now you have a really small ammount of data do process, and it should be very fast!

that is not the formula the OP wrote, and I think the triangle equality actually _does_ hold under his version. — , Jul 12 '11 at 09:26

score 2 · Answer 5 · answered Jul 12 '11 at 02:27

2

abs(lat2 - lat1) + abs(lon2 - lon1) is the 1-norm or Manhattan-metric and thus the triangle inequality holds.

answered Jul 12 '11 at 02:27

This is good to know. I actually didn't write this in my code, but it's clearly what I was going for. Thanks for the help. – Wilduck Jul 12 '11 at 13:23
1

@Wilduck: I actually had something about your formular not being quite a reinvention of the 1-norm in my first draft of the answer, but I edited it down for sarcasm… anyway, I /think/ your formula actually is a pseudo-metric: triangle inequality and symmetry hold but two distinct points can have a distance of 0. to lazy to check, though. – Jul 12 '11 at 18:14

score 1 · Answer 6 · edited Feb 15 '23 at 19:01

The fastest way to do this is to avoid computing a function for each pair of points, assuming your relatively small collection isn't very tiny.

There are some databases that have geo-indexes you could use (mysql, oracle, mongodb..), or implement something yourself.

You could use python-geohash. For each doc in the smaller collection you need to quickly find the set of documents in the larger collection that share a hash from geohash.neighbors for the longest hash size that has matches. You'll need to use an appropriate datastructure for the lookup or this will be slow.

For finding the distance between points, the error of the simple approach increases as the distance between the points increases and also depends on the latitude. See http://www.movable-type.co.uk/scripts/gis-faq-5.1.html for example.

Python: speeding up geographic comparison

6 Answers6

Linked

Related