5

I have two dataframes and both dataframes contain Longitudes and Latitudes columns. I want to merge those two dataframes based on Longitudes and Latitudes columns. First, I applied a normal merge function and it produced empty resultant data frame. I investigated found that both dataframes do not have the same Longitudes and Latitudes columns. Then I tried another function called merge_asof and set the direction to nearest, which means merge should be done based on the nearest value of Longitude and Latitude. My code looks like below:

pd.merge_asof(df1,df2,left_on=['x','y'],right_on=['Long','Lat'],direction='nearest')

When I ran the above code, it throws an error message:

raise MergeError("can only asof on a key for left")
pandas.errors.MergeError: can only asof on a key for left

When I Googled the message I found that (https://github.com/pandas-dev/pandas/issues/20369) we couldn't use the combination of columns in merge_asof function. How can I solve this issue?

double-beep
  • 5,031
  • 17
  • 33
  • 41
user2293224
  • 2,128
  • 5
  • 28
  • 52
  • Make sure you have set your index to the keys in both data frames. You can use df.set_index(['Long', 'Long'], inplace=True) – UnbreakableMystic Feb 02 '20 at 22:16
  • Not sure, but can you check by Point column and defining the subract and all other ops in it. Now change your lat, long attr into Point class and keep and single column and pass it to the function. I think this workaround will work. – Harxish Mar 19 '23 at 14:11

1 Answers1

2

This is an old question, but since it was edited this morning, hopefully this answer may still be useful to somebody.

What comes to mind is a nearest neighbor search, to match rows that are closest and within some tolerance. Of course, the distance of latitude, longitude is a bit tricky. A distance of 1 degree longitude at the equator is much larger than a distance of 1 degree close to the poles. Furthermore, crossing from +180 to -180 in longitude can be a very small actual Euclidean distance, despite the large difference of longitudes.

For these reasons, we could do the following:

  1. Compute 3D coordinates for each lat, lon position.
  2. Use scipy.spatial.KDTree to efficiently find neighbors.

But first, let's write a function to generate minimal reproducible examples:

def gen(n, dlatlon=(1, 1)):
    # latitude goes from -90 to 90 degrees, longitude from -180 to 180
    # here we don't care about exceeding those limits -- 2xyz will take
    # care of it
    upper = np.array([90, 180])
    latlon0 = np.random.uniform(-1, 1, (n, 2)) * upper
    latlon1 = latlon0 + np.random.normal(size=(n, 2)) * dlatlon
    i = np.arange(n)
    df0 = pd.DataFrame(latlon0, columns=['lat', 'lon']).assign(a=i)
    df1 = pd.DataFrame(latlon1, columns=['lat', 'lon']).assign(b=i)
    return df0, df1

Example:

np.random.seed(0)  # reproducible example
df0, df1 = gen(6)

>>> df0
         lat         lon  a
0   8.786431   77.468172  0
1  18.497408   16.157946  1
2 -13.742136   52.521881  2
3 -11.234302  141.038280  3
4  83.459297  -41.961053  4
5  52.510507   10.402171  5

>>> df1
         lat         lon  b
0   9.547468   77.589847  0
1  18.941271   16.491620  1
2 -12.248057   52.316722  2
3 -10.921234  140.184185  3
4  80.906307  -41.307435  4
5  53.374943    9.660006  5

Now, let us write a conversion to 3D (similar to my answer here):

R_earth = 6371  # in km, ignoring flattening


def latlon2xyz(latlon, R=R_earth):
    latlon = latlon.to_numpy() if isinstance(latlon, pd.DataFrame) else latlon
    lat, lon = np.deg2rad(latlon).T
    
    # conversion (latitude, longitude, altitude) to (x, y, z)
    # see https://stackoverflow.com/a/10788250/758174
    # we use alt and f = 0 -> F = 1, S = C
    coslat = np.cos(lat)
    sinlat = np.sin(lat)
    C      = 1 / np.sqrt(coslat**2 + sinlat**2)

    x = C * coslat * np.cos(lon)
    y = C * coslat * np.sin(lon)
    z = C * sinlat
    
    return R * np.c_[x, y, z]

def append3d(df):
    return df.assign(**dict(zip('xyz', latlon2xyz(df[['lat', 'lon']]).T)))

On our example data above:

>>> append3d(df0)
         lat         lon  a            x            y            z
0   8.786431   77.468172  0  1366.168859  6146.229829   973.181661
1  18.497408   16.157946  1  5803.197072  1681.366619  2021.274608
2 -13.742136   52.521881  2  3765.522822  4911.207155 -1513.447441
3 -11.234302  141.038280  3 -4858.951861  3929.329400 -1241.208394
4  83.459297  -41.961053  4   539.640845  -485.230993  6329.532340
5  52.510507   10.402171  5  3813.764098   700.106078  5055.165267

>>> append3d(df1)
         lat         lon  b            x            y            z
0   9.547468   77.589847  0  1350.216201  6135.950784  1056.723796
1  18.941271   16.491620  1  5778.118840  1710.637588  2068.019033
2 -12.248057   52.316722  2  3805.920396  4927.257031 -1351.572821
3 -10.921234  140.184185  3 -4804.978175  4005.604427 -1207.045530
4  80.906307  -41.307435  4   756.386071  -664.675318  6290.924243
5  53.374943    9.660006  5  3746.893184   637.776657  5113.088440

Then, we can use KDTree to find nearest neighbors, and restrict those neighbors to within a given maximum Euclidean distance r, corresponding to a given deviation in latitude and longitude (at the equator):

from scipy.spatial import KDTree

def latlon_merge(df0, df1, r=100):
    # r: maximum distance, in km, between two points for them
    # to be considered close neighbors
    kd = KDTree(latlon2xyz(df0[['lat', 'lon']]))
    dist, idx = kd.query(latlon2xyz(df1[['lat', 'lon']]), distance_upper_bound=r)
    key = '_ix_'
    assert key not in df0.columns
    assert key not in df1.columns
    df = pd.merge(
        df0.assign(**{key: np.arange(df0.shape[0])}),
        df1.assign(**{key: idx}),
        'outer', on=key)
    return df.drop(key, axis=1)

Note that this difference r for 2 degrees in both latitude and longitude, at the equator, is about 314.5 km:

r = np.linalg.norm(np.subtract(*latlon2xyz(np.array(((-1, -1), (1, 1))))))

>>> r
314.4668312040188

On our example data:

>>> latlon_merge(df0, df1, r)
       lat_x       lon_x  a      lat_y       lon_y  b
0   8.786431   77.468172  0   9.547468   77.589847  0
1  18.497408   16.157946  1  18.941271   16.491620  1
2 -13.742136   52.521881  2 -12.248057   52.316722  2
3 -11.234302  141.038280  3 -10.921234  140.184185  3
4  83.459297  -41.961053  4  80.906307  -41.307435  4
5  52.510507   10.402171  5  53.374943    9.660006  5

In this case, we merged all the correct rows together. But what if we made the tolerance r small enough that some of the rows above won't match?

>>> latlon_merge(df0, df1, r=150)
       lat_x       lon_x    a      lat_y       lon_y    b
0   8.786431   77.468172  0.0   9.547468   77.589847  0.0
1  18.497408   16.157946  1.0  18.941271   16.491620  1.0
2 -13.742136   52.521881  2.0        NaN         NaN  NaN
3 -11.234302  141.038280  3.0 -10.921234  140.184185  3.0
4  83.459297  -41.961053  4.0        NaN         NaN  NaN
5  52.510507   10.402171  5.0  53.374943    9.660006  5.0
6        NaN         NaN  NaN -12.248057   52.316722  2.0
7        NaN         NaN  NaN  80.906307  -41.307435  4.0

Speed

Let's see how efficient this is at scale.

df0, df1 = gen(100_000)
%timeit latlon_merge(df0, df1)
# 161 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pierre D
  • 24,012
  • 7
  • 60
  • 96