Context:
- I have one city dataset with coordinates (lat, long)
- I have several other specific datasets (hospitals, shops,...) with coordinates (lat, long) too
My objective is to find, for each city, the closest (or the N closest) of every other datasets.
Code:
I defined a function to calculate a Haversine distance:
def dist(lat1, long1, lat2, long2):
# convert decimal degrees to radians
lat1, long1, lat2, long2 = map(radians, [lat1, long1, lat2, long2])
# haversine formula
dlon = long2 - long1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
km = 6371* c
return km
Now I use this function to find the nearest point:
def find_nearest(lat, long, recherche):
distances = recherche.apply(
lambda x: dist(lat, long, x['rech_lat'], x['rech_lon']),
axis=1)
return recherche.loc[distances.idxmin(), 'rech_id']
Which I call like this:
CITY['hospital_id'] = CITY.apply(lambda x: find_nearest(x['COM_LAT'], x['COM_LONG'],hospital),axis=1)
Problem:
Doing so, I need to pass the hospital dataframe every time. I am not sure it is very performant. I thought using the reference of the dataframe with the eval function instead:
def find_nearest(lat, long, recherch):
recherche = eval(recherch)
distances = recherche.apply(
lambda x: dist(lat, long, x['rech_lat'], x['rech_lon']),
axis=1)
return recherche.loc[distances.idxmin(), 'rech_id']
CITY['hospital_id'] = CITY.apply(lambda x: find_nearest(x['COM_LAT'], x['COM_LONG'],'hospital'),axis=1)
Is it better? I still can't have fast answer. Do you know how I can improve more?
Thanks for answers