I have tables like these:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([
['A', (37.55, 126.97)],
['B', (37.56, 126.97)],
['C', (37.57, 126.98)]
], columns=['STA_NM', 'COORD'])
df2 = pd.DataFrame([
['A-01', (37.57, 126.99)]
], columns=['ID', 'COORD'])
I'm trying to pick each coordinates from df2
and find the two closest stations(STA_NM
) and their distances to each coordinates from df1
, then add them to a new column of df2
. I tried following codes:
from heapq import nsmallest
from math import cos, asin, sqrt
def dist(x, y):
p = 0.017453292519943295
a = 0.5 - cos((y[0] - x[0]) * p) / 2 + cos(x[0] * p) * cos(y[0] * p) * (1 - cos((y[1] - x[1]) * p)) / 2
return 12741 * asin(sqrt(a))
def shortest(df, v):
l_sta = []
# get a list of coords
l_coord = df['COORD'].tolist()
# get the two nearest coordinates
near_coord = nsmallest(2, l_coord, key=lambda p: dist(v, p))
# find station names
l_sta.append((df.loc[df['COORD'] == near_coord[0], 'STA_NM'].to_string(index=False), round(dist(near_coord[0], v) * 1000)))
l_sta.append((df.loc[df['COORD'] == near_coord[1], 'STA_NM'].to_string(index=False), round(dist(near_coord[1], v) * 1000)))
# e.g.: [('A', 700), ('B', 1000)]
return l_sta
df2['NEAR_STA'] = df2['COORD'].map(lambda x: shortest(df1, x))
In original data, df1
has about 700 rows, and df2
has about 55k rows. When I tried above codes, it took near two minutes. Is there any better way to make it faster?