This is an old question, but since it was edited this morning, hopefully this answer may still be useful to somebody.
What comes to mind is a nearest neighbor search, to match rows that are closest and within some tolerance. Of course, the distance of latitude, longitude is a bit tricky. A distance of 1 degree longitude at the equator is much larger than a distance of 1 degree close to the poles. Furthermore, crossing from +180 to -180 in longitude can be a very small actual Euclidean distance, despite the large difference of longitudes.
For these reasons, we could do the following:
- Compute 3D coordinates for each
lat, lon
position.
- Use
scipy.spatial.KDTree
to efficiently find neighbors.
But first, let's write a function to generate minimal reproducible examples:
def gen(n, dlatlon=(1, 1)):
# latitude goes from -90 to 90 degrees, longitude from -180 to 180
# here we don't care about exceeding those limits -- 2xyz will take
# care of it
upper = np.array([90, 180])
latlon0 = np.random.uniform(-1, 1, (n, 2)) * upper
latlon1 = latlon0 + np.random.normal(size=(n, 2)) * dlatlon
i = np.arange(n)
df0 = pd.DataFrame(latlon0, columns=['lat', 'lon']).assign(a=i)
df1 = pd.DataFrame(latlon1, columns=['lat', 'lon']).assign(b=i)
return df0, df1
Example:
np.random.seed(0) # reproducible example
df0, df1 = gen(6)
>>> df0
lat lon a
0 8.786431 77.468172 0
1 18.497408 16.157946 1
2 -13.742136 52.521881 2
3 -11.234302 141.038280 3
4 83.459297 -41.961053 4
5 52.510507 10.402171 5
>>> df1
lat lon b
0 9.547468 77.589847 0
1 18.941271 16.491620 1
2 -12.248057 52.316722 2
3 -10.921234 140.184185 3
4 80.906307 -41.307435 4
5 53.374943 9.660006 5
Now, let us write a conversion to 3D (similar to my answer here):
R_earth = 6371 # in km, ignoring flattening
def latlon2xyz(latlon, R=R_earth):
latlon = latlon.to_numpy() if isinstance(latlon, pd.DataFrame) else latlon
lat, lon = np.deg2rad(latlon).T
# conversion (latitude, longitude, altitude) to (x, y, z)
# see https://stackoverflow.com/a/10788250/758174
# we use alt and f = 0 -> F = 1, S = C
coslat = np.cos(lat)
sinlat = np.sin(lat)
C = 1 / np.sqrt(coslat**2 + sinlat**2)
x = C * coslat * np.cos(lon)
y = C * coslat * np.sin(lon)
z = C * sinlat
return R * np.c_[x, y, z]
def append3d(df):
return df.assign(**dict(zip('xyz', latlon2xyz(df[['lat', 'lon']]).T)))
On our example data above:
>>> append3d(df0)
lat lon a x y z
0 8.786431 77.468172 0 1366.168859 6146.229829 973.181661
1 18.497408 16.157946 1 5803.197072 1681.366619 2021.274608
2 -13.742136 52.521881 2 3765.522822 4911.207155 -1513.447441
3 -11.234302 141.038280 3 -4858.951861 3929.329400 -1241.208394
4 83.459297 -41.961053 4 539.640845 -485.230993 6329.532340
5 52.510507 10.402171 5 3813.764098 700.106078 5055.165267
>>> append3d(df1)
lat lon b x y z
0 9.547468 77.589847 0 1350.216201 6135.950784 1056.723796
1 18.941271 16.491620 1 5778.118840 1710.637588 2068.019033
2 -12.248057 52.316722 2 3805.920396 4927.257031 -1351.572821
3 -10.921234 140.184185 3 -4804.978175 4005.604427 -1207.045530
4 80.906307 -41.307435 4 756.386071 -664.675318 6290.924243
5 53.374943 9.660006 5 3746.893184 637.776657 5113.088440
Then, we can use KDTree
to find nearest neighbors, and restrict those neighbors to within a given maximum Euclidean distance r
, corresponding to a given deviation in latitude and longitude (at the equator):
from scipy.spatial import KDTree
def latlon_merge(df0, df1, r=100):
# r: maximum distance, in km, between two points for them
# to be considered close neighbors
kd = KDTree(latlon2xyz(df0[['lat', 'lon']]))
dist, idx = kd.query(latlon2xyz(df1[['lat', 'lon']]), distance_upper_bound=r)
key = '_ix_'
assert key not in df0.columns
assert key not in df1.columns
df = pd.merge(
df0.assign(**{key: np.arange(df0.shape[0])}),
df1.assign(**{key: idx}),
'outer', on=key)
return df.drop(key, axis=1)
Note that this difference r
for 2 degrees in both latitude and longitude, at the equator, is about 314.5 km:
r = np.linalg.norm(np.subtract(*latlon2xyz(np.array(((-1, -1), (1, 1))))))
>>> r
314.4668312040188
On our example data:
>>> latlon_merge(df0, df1, r)
lat_x lon_x a lat_y lon_y b
0 8.786431 77.468172 0 9.547468 77.589847 0
1 18.497408 16.157946 1 18.941271 16.491620 1
2 -13.742136 52.521881 2 -12.248057 52.316722 2
3 -11.234302 141.038280 3 -10.921234 140.184185 3
4 83.459297 -41.961053 4 80.906307 -41.307435 4
5 52.510507 10.402171 5 53.374943 9.660006 5
In this case, we merged all the correct rows together. But what if we made the tolerance r
small enough that some of the rows above won't match?
>>> latlon_merge(df0, df1, r=150)
lat_x lon_x a lat_y lon_y b
0 8.786431 77.468172 0.0 9.547468 77.589847 0.0
1 18.497408 16.157946 1.0 18.941271 16.491620 1.0
2 -13.742136 52.521881 2.0 NaN NaN NaN
3 -11.234302 141.038280 3.0 -10.921234 140.184185 3.0
4 83.459297 -41.961053 4.0 NaN NaN NaN
5 52.510507 10.402171 5.0 53.374943 9.660006 5.0
6 NaN NaN NaN -12.248057 52.316722 2.0
7 NaN NaN NaN 80.906307 -41.307435 4.0
Speed
Let's see how efficient this is at scale.
df0, df1 = gen(100_000)
%timeit latlon_merge(df0, df1)
# 161 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)