4

I have tables like these:

import pandas as pd
import numpy as np


df1 = pd.DataFrame([
    ['A', (37.55, 126.97)],
    ['B', (37.56, 126.97)],
    ['C', (37.57, 126.98)]
], columns=['STA_NM', 'COORD'])

df2 = pd.DataFrame([
    ['A-01', (37.57, 126.99)]
], columns=['ID', 'COORD'])

I'm trying to pick each coordinates from df2 and find the two closest stations(STA_NM) and their distances to each coordinates from df1, then add them to a new column of df2. I tried following codes:

from heapq import nsmallest
from math import cos, asin, sqrt


def dist(x, y):
    p = 0.017453292519943295
    a = 0.5 - cos((y[0] - x[0]) * p) / 2 + cos(x[0] * p) * cos(y[0] * p) * (1 - cos((y[1] - x[1]) * p)) / 2
    return 12741 * asin(sqrt(a))

def shortest(df, v):
    l_sta = []
    
    # get a list of coords
    l_coord = df['COORD'].tolist()
    
    # get the two nearest coordinates
    near_coord = nsmallest(2, l_coord, key=lambda p: dist(v, p))

    # find station names
    l_sta.append((df.loc[df['COORD'] == near_coord[0], 'STA_NM'].to_string(index=False), round(dist(near_coord[0], v) * 1000)))
    l_sta.append((df.loc[df['COORD'] == near_coord[1], 'STA_NM'].to_string(index=False), round(dist(near_coord[1], v) * 1000)))
    
    # e.g.: [('A', 700), ('B', 1000)]
    return l_sta

df2['NEAR_STA'] = df2['COORD'].map(lambda x: shortest(df1, x))

In original data, df1 has about 700 rows, and df2 has about 55k rows. When I tried above codes, it took near two minutes. Is there any better way to make it faster?

vuvugelato
  • 65
  • 5
  • 1
    The "fastest way" might be a lot more advanced., but the question has already been dealt with in 3d points at least: https://stackoverflow.com/questions/4350215/what-is-the-fastest-way-to-find-the-closest-point-to-a-given-point – Andrew Allaire Sep 16 '20 at 14:26
  • 1
    I believe SciPy has [`scipy.spatial.KDTree`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html) for situations like this. – Aaron Keesing Sep 16 '20 at 14:39
  • 1
    Also recalling there is RTree that might help, but I am not certain: https://pypi.org/project/Rtree/ – Andrew Allaire Sep 16 '20 at 14:41

1 Answers1

0

You could convert the lat/lon coordinates to earth-centered, earth-fixed (ECEF) coordinates (lat and lon become x/y/z from the earth's core) before doing the distance calculation. That would make your dist function faster, since it would become a single euclidean distance calculation.

You could also ditch the dataframe/lambda approach and use cython or numba to speed this up significantly.

There is also opportunity to speed things up if you know what the spatial distribution of your stations looks like. For example, if they are on a regular grid, then you only have to look at the four neighboring stations. If you know that there are usually at least 2 stations within some distance from another, then you only need to search within that radius. If you have no such prior information then sorry no tricks.

Graham S
  • 1,642
  • 10
  • 12