For large datasets in Python, how do I find the nearest location using longitude and latitude?

Question

I have a pandas dataframe containing 500.000(!) rows (locations) and two columns:

Longitude
Latitude

Now I want a third column:

Nearest location

This column should tell me which row/location is nearest to the 'current' row/location.

I know you can find the distance between two lon/lat using for example cdist from scipy.spatial.distance. However, this takes too much time, since it has to loop through the data set 500.000 * 500.000 times (because it tries to find the distance to each location, for every location).

Does anyone know how an appropriate way to deal with this?

Look up "sorting spatial data" or so on Google. It's out of scope for SO, but certainly a worthwhile and interesting subject. — Mad Physicist, Feb 25 '20 at 15:53
Why does it need to be 500,000 * 500,000? Aren't you trying to find the nearest location from those 500,000 to some reference location? — roganjosh, Feb 25 '20 at 15:56
Yes, so if you try to find the nearest location for one location, you need to loop through the dataset 500.000 times. However, I need to do this for every location in the data set. So: 500.000 * 500.000 ? — LVDW, Feb 25 '20 at 15:59
Does this answer your question? [Fast Haversine Approximation (Python/Pandas)](https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas) — roganjosh, Feb 25 '20 at 16:01
I understand that link is also going to be slow, but it's the best definitive approach I am aware of. Beyond that, I'm not sure it can be answered in the scope of SO — roganjosh, Feb 25 '20 at 16:11

For large datasets in Python, how do I find the nearest location using longitude and latitude?

0 Answers0