4

I am currently trying some geocoding in Python. The process is the following: I have two data frames (df1 and df2, houses and schools) with latitude and longitude values and want to find the nearest neighbour in df2 for every observations in df1. I use the following code:

from tqdm import tqdm
import numpy as np
import pandas as pd
import math 

def distance(lat1, long1, lat2, long2):
        R = 6371 # Earth Radius in Km
        dLat = math.radians(lat2 - lat1) # Convert Degrees 2 Radians 
        dLong = math.radians(long2 - long1)
        lat1 = math.radians(lat1)
        lat2 = math.radians(lat2)
        a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLong/2) * math.sin(dLong/2) * math.cos(lat1) * math.cos(lat2)
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
        d = R * c
        return d

dists = []
schools =[]
for index, row1 in tqdm(df1.iterrows()):
    for index, row2 in df2.iterrows():
        dists.append(distance(row1.lat, row1.lng, row2.Latitude, row2.Longitude))
    schools.append(min(dists))
    del dists [:]

df1["school"] = pd.Series(schools)

The code works, however it takes ages. With tqdm I get an average speed of 2 iterations of df1 per second. As a comparison, I did the whole task in STATA with geonear and it takes 1 second for all observations in df1 (950). I read in the helpfile of geonear that they use clustering, to not calculate all distances, but only the closest. However, before I add a clustering function (which might also take CPU power), I wonder if someone sees some way to speed up the process as it is (I am new to python and might have some inefficient code that slows the process down). Or is there maybe a package that does the process faster?

I would be ok if it takes longer than in STATA, but not nearly 7 minutes...

Thank you in advance

Richard
  • 56,349
  • 34
  • 180
  • 251
user27074
  • 627
  • 1
  • 6
  • 20

2 Answers2

4

The way you're doing this is slow because you are using an O(n²) algorithm: each row looks at every other row. Georgy's answer, while introducing vectorization, does not solve this fundamental inefficiency.

I'd recommend loading your data points into a kd-tree: this data structure provides a fast way of finding nearest neighbours in multiple dimensions. Construction of such a tree is in O(n log n) and a query takes O(log n), so total time is in O(n log n).

If your data is localized to a geographic region that can be well-approximated by a plane, project your data and then perform the lookup in two dimensions. Otherwise, if your data is globally dispersed, project into spherical cartesian coordinates and perform the look-up there.

An example of how you might do this appears as follows:

#/usr/bin/env python3

import numpy as np
import scipy as sp
import scipy.spatial

Rearth = 6371

#Generate uniformly-distributed lon-lat points on a sphere
#See: http://mathworld.wolfram.com/SpherePointPicking.html
def GenerateUniformSpherical(num):
  #Generate random variates
  pts      = np.random.uniform(low=0, high=1, size=(num,2))
  #Convert to sphere space
  pts[:,0] = 2*np.pi*pts[:,0]          #0-360 degrees
  pts[:,1] = np.arccos(2*pts[:,1]-1)   #0-180 degrees
  #Convert to degrees
  pts = np.degrees(pts)
  #Shift ranges to lon-lat
  pts[:,0] -= 180
  pts[:,1] -= 90
  return pts

def ConvertToXYZ(lonlat):
  theta  = np.radians(lonlat[:,0])+np.pi
  phi    = np.radians(lonlat[:,1])+np.pi/2
  x      = Rearth*np.cos(theta)*np.sin(phi)
  y      = Rearth*np.sin(theta)*np.sin(phi)
  z      = Rearth*np.cos(phi)
  return np.transpose(np.vstack((x,y,z)))

#For each entry in qpts, find the nearest point in the kdtree
def GetNearestNeighbours(qpts,kdtree):
  pts3d        = ConvertToXYZ(qpts)
  #See: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
  #p=2 implies Euclidean distance, eps=0 implies no approximation (slower)
  return kdtree.query(pts3d,p=2,eps=0) 

#Generate uniformly-distributed test points on a sphere. Note that you'll want
#to find a way to extract your pandas columns into an array of width=2, height=N
#to match this format.
df1 = GenerateUniformSpherical(10000)
df2 = GenerateUniformSpherical(10000)

#Convert df2 into XYZ coordinates. WARNING! Do not alter df2_3d or kdtree will
#malfunction!
df2_3d = ConvertToXYZ(df2)
#Build a kd-tree from df2_3D
kdtree = sp.spatial.KDTree(df2_3d, leafsize=10) #Stick points in kd-tree for fast look-up

#Return the distance to, and index of, each of df1's nearest neighbour points
distance, indices = GetNearestNeighbours(df1,kdtree)
Richard
  • 56,349
  • 34
  • 180
  • 251
  • 2
    @Georgy: Vectorization in Python is often the key component to good performance. Your code taps into this source of performance in a simple way for OP. I'll probably work very well for smaller datasets, especially since it has a smaller construction cost than the kd-tree. At some point, though, the relative time complexities will dominate and the gorilla will strike :-) – Richard Feb 05 '18 at 19:00
  • 1
    @Georgy: I've uploaded some code showing how the kd-tree might be used. – Richard Feb 05 '18 at 19:53
  • @Richard I tried Georgy's solution as it looked a bit shorter. However, as mentioned above, I probably have to come back to your solution, as my actual dataset contains more than 3 million observations. Thank you in advance, also for the link. Do you have maybe more to read on the efficient programming like this? – user27074 Feb 06 '18 at 08:36
  • @user27074: I recommend reading the first half of Skiena's "Algorithm Design Manual" several times. It's a good introduction to algorithms. After that, just learn all the algorithms you can. – Richard Feb 07 '18 at 21:48
2

The key to efficiency with pandas is performing operations for a whole dataframe/series, not go row by row. So, let's do this.

for index, row1 in tqdm(df1.iterrows()):
    for index, row2 in df2.iterrows():

Here you calculate Cartesian product of two dataframes. This can be done much faster like this:

df_product = pd.merge(df1.assign(key=0, index=df1.index), 
                      df2.assign(key=0), 
                      on='key').drop('key', axis=1)

(Code was taken from here). I also added a column with indices of df1, we will need it later to calculate min of distances for every entity from df1.


Now calculating all the deltas, latitudes in radians, a, c and distances in vectorized manner using numpy:

dLat = np.radians(df_product['Latitude'] - df_product['lat'])
dLong = np.radians(df_product['Longitude'] - df_product['lng'])
lat1 = np.radians(df_product['lat'])
lat2 = np.radians(df_product['Latitude'])
a = (np.sin(dLat / 2) ** 2 
     + (np.sin(dLong / 2) ** 2) * np.cos(lat1) * np.cos(lat2))
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
df_product['d'] = R * c

Now, from df_product we leave only column with indices that we added before and column with distances. We group distances by indices, calculate corresponding minimum values and assign them to df1['schools'] as you did in your code.

df1['schools'] = df_product.loc[:, ['index', 'd']].groupby('index', axis=0).min()

This is it. For 1000 rows in each dataframe everything takes less than a second for me.

Georgy
  • 12,464
  • 7
  • 65
  • 73
  • @Richard Yeah, well.. For so many rows there will be a memory error. – Georgy Feb 05 '18 at 18:31
  • @Richard I am not aware on the best approach to avoid the `MemoryError` in such cases. Either it is to process data by chunks or to avoid pandas at all and use generators. Never encountered problems like this myself before. So, I hope OP will be satisfied with my solution. – Georgy Feb 05 '18 at 18:36
  • 1
    pandas should be able to handle a million rows easily, though it may choke on the merge, since that could give an _O(N²)_ expansion in memory or time if it's implemented poorly in pandas. What I wanted to point out, indirectly, is that your solution is still operating in _O(N²)_ time, just more efficiently than OP's. This means that it will scale poorly regardless of memory limitations. – Richard Feb 05 '18 at 18:44
  • @Georgy Thank you very much! I used this code and is indeed way faster. However, my idea was to program the code on a subsample (950 houses x 5300 schools) and later use it for my actual data (3 million houses x 10000 schools). So I will also try Richard`s solution later (I still have to clean the other dataset). Since I am learning Python mostly through online sources, which are mostly oriented on applications and not on performance, I am curious if you have any sources to read regarding this topic (e.g. why is this faster than my original approach). Thank you! – user27074 Feb 06 '18 at 08:27
  • @user27074 Whatever I learnt about numpy was from Stack Overflow and numpy docs. So, I can't really give any advice here. But I know that there is a book called 'High performance Python'. Maybe it can be of any help to you. – Georgy Feb 06 '18 at 10:50