0

I have a dataframe with tuples of latitudes and longitudes as below (sample of actual coordinates):

    id    latlon             
67  79    (39.1791764701497, -96.5772313693982)
68  17    (39.1765194942359, -96.5677757455844)
69  76    (39.1751440428827, -96.5772939901891)
70  58    (39.175359525189, -96.5691986655256)
71  50    (39.1770962912298, -96.5668107589661)

I want to find the id and the distance of the nearest latlon in the same dataframe (for illustration, I'm just making up numbers below in nearest_id and nearest_dist columns):

    id    latlon                                  nearest_id  nearest_dist
67  79    (39.1791764701497, -96.5772313693982)   17          37          
68  17    (39.1765194942359, -96.5677757455844)   58          150           
69  76    (39.1751440428827, -96.5772939901891)   50          900          
70  58    (39.175359525189, -96.5691986655256)    17          12          
71  50    (39.1770962912298, -96.5668107589661)   79          4      

I have a large number (45K+) of coordinates on which I want to perform this operation.

Here is my attempted solution below, using great_circle from geopy.distances:

def great_circle_dist(latlon1, latlon2):
    """Uses geopy to calculate distance between coordinates"""
    return great_circle(latlon1, latlon2).meters

def find_nearest(x):
        """Finds nearest neighbor """
        df['distances'] = df.latlon.apply(great_circle_dist, args=(x,))
        df_sort = df.sort_values(by='distances')
        return (df_sort.values[1][0], df_sort.values[1][2])

df['nearest'] = df['latlon'].apply(find_nearest)
df['nearest_id'] = df.nearest.apply(lambda x: x[0])
df['nearest_dist'] = df.nearest.apply(lambda x: x[1])
del df['nearest']
del df['distances']

What can be done to make this calculation efficiently?

quantif
  • 166
  • 3
  • 12

3 Answers3

2

'scipy.spatial' has many useful (and extremely fast) algorithm for spatial search. The one that seems to be the right tool for your problem is 'cKDTree'.

tree = cKDTree(data)

Data should be a numpy array of shape n*2 (it can calculate distance in n dimentional space, but in this case we have two dimensions)

Then you can query the tree for k nearest neighbors:

dist, idx = tree.query(x, k=1)

Using the index, it should be trivial to get the id. I answered a similar question here. Also check out comments for information about projection.

Alz
  • 755
  • 6
  • 11
  • Would using `cKDTree` here assume that the input is cartesian coordinates? – quantif Aug 30 '17 at 01:26
  • @JosephDasenbrock yes. You can either project the coordinates from lon/lat to UTM (or any other projection suitable for measurement) using 'pyproj', or use great-circle or even better, haversine formula as a custom distance metric with scipy.spatial.distance.. The second approach is explained in another solution to the [same question](https://stackoverflow.com/a/45807448/6517541) – Alz Aug 30 '17 at 09:12
  • Is the `cKDTree` 100% accurate or is it a search algo which prioritizes speed over complete accuracy? – quantif Aug 31 '17 at 16:16
  • @JosephDasenbrock It is exact, it's not a heuristic. If you wish, you can specify 'eps' argument in 'query()' to be positive, which speed up the search but return approximate neighbors. Default is zero, which means exact neighbors. – Alz Aug 31 '17 at 17:07
1

You can do this with PostGIS/PostgreSQL efficiently but then you would have to get your data into an sql table which might be difficult. You can issue postgresql commands from python but you still have to set up the backend. Hopefully someone will be able to give you tips on how to use this just using python.

Bootstrap
  • 103
  • 7
1

Spatial indexing should help.

You can achieve spatial indexing using a database (e.g. Postgres with PosGIS extension), but you can also have an in-memory solution.

Have a look at the Rtree library. You will need to create an index, add all your points to the index, and then query the index using the nearest method.

daphshez
  • 9,272
  • 11
  • 47
  • 65