I am currently trying some geocoding in Python. The process is the following: I have two data frames (df1 and df2, houses and schools) with latitude and longitude values and want to find the nearest neighbour in df2 for every observations in df1. I use the following code:
from tqdm import tqdm
import numpy as np
import pandas as pd
import math
def distance(lat1, long1, lat2, long2):
R = 6371 # Earth Radius in Km
dLat = math.radians(lat2 - lat1) # Convert Degrees 2 Radians
dLong = math.radians(long2 - long1)
lat1 = math.radians(lat1)
lat2 = math.radians(lat2)
a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLong/2) * math.sin(dLong/2) * math.cos(lat1) * math.cos(lat2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = R * c
return d
dists = []
schools =[]
for index, row1 in tqdm(df1.iterrows()):
for index, row2 in df2.iterrows():
dists.append(distance(row1.lat, row1.lng, row2.Latitude, row2.Longitude))
schools.append(min(dists))
del dists [:]
df1["school"] = pd.Series(schools)
The code works, however it takes ages. With tqdm I get an average speed of 2 iterations of df1 per second. As a comparison, I did the whole task in STATA with geonear and it takes 1 second for all observations in df1 (950). I read in the helpfile of geonear that they use clustering, to not calculate all distances, but only the closest. However, before I add a clustering function (which might also take CPU power), I wonder if someone sees some way to speed up the process as it is (I am new to python and might have some inefficient code that slows the process down). Or is there maybe a package that does the process faster?
I would be ok if it takes longer than in STATA, but not nearly 7 minutes...
Thank you in advance