0

I have a pandas dataframe containing 500.000(!) rows (locations) and two columns:

  • Longitude
  • Latitude

Now I want a third column:

  • Nearest location

This column should tell me which row/location is nearest to the 'current' row/location.

I know you can find the distance between two lon/lat using for example cdist from scipy.spatial.distance. However, this takes too much time, since it has to loop through the data set 500.000 * 500.000 times (because it tries to find the distance to each location, for every location).

Does anyone know how an appropriate way to deal with this?

LVDW
  • 11
  • 4
  • Look up "sorting spatial data" or so on Google. It's out of scope for SO, but certainly a worthwhile and interesting subject. – Mad Physicist Feb 25 '20 at 15:53
  • Why does it need to be 500,000 * 500,000? Aren't you trying to find the nearest location from those 500,000 to some reference location? – roganjosh Feb 25 '20 at 15:56
  • Yes, so if you try to find the nearest location for one location, you need to loop through the dataset 500.000 times. However, I need to do this for every location in the data set. So: 500.000 * 500.000 ? – LVDW Feb 25 '20 at 15:59
  • 1
    Does this answer your question? [Fast Haversine Approximation (Python/Pandas)](https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas) – roganjosh Feb 25 '20 at 16:01
  • I understand that link is also going to be slow, but it's the best definitive approach I am aware of. Beyond that, I'm not sure it can be answered in the scope of SO – roganjosh Feb 25 '20 at 16:11

0 Answers0