0

I am trying to merge two dataframes based on matches between pairs of column values. However, the column values are not exact from one dataframe to the next. The pairs are coordinates using the Swiss coordinate system, but measured from a slightly different reference point in each df.

This stackoverflow thread How to find the distance between 2 points in 2 different dataframes in pandas? seems to be a related query, but unfortunately I don't fully understand the response.

Example for my data:

df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
                    'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
                    'shape': ['square', 'square', 'triangle', 'circle']})

df1
     Ecode            Ncode          shape
0   2669827.294     1261034.528     square
1   2669634.483     1262412.587     square
2   2669766.266     1261209.646     triangle
3   2669960.683     1262550.374     circle


df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
                    'CoorN': [1262413, 1261211, 1261032, 1262550],
                    'color': ['purple', 'blue', 'blue', 'yellow']})

df2
     CoorE       CoorN      color
0   2669636     1262413     purple
1   2669765     1261211     blue
2   2669827     1261032     blue
3   2669961     1262550     yellow

I have data I would like to compare located with both sets of coordinates (ex. 'shape' and 'color'). My desired outcome matches the column pairs on the closest match:

     CoorE       CoorN      color   shape
0   2669636     1262413     purple  square
1   2669765     1261211     blue    triangle
2   2669827     1261032     blue    square
3   2669961     1262550     yellow  circle

Is there a way to do this? I have tried to use merge_asof but realized it can't key on two variables. I have also seen threads computing this based on latitude and longitude. I can write a function that treats CoorE/CoorN and Ecode/Ncode as x/y coordinates, and calculates the distance between a pair of coordinates (probably there is a better way, but I am new to this):

import math  
def calculateDistance(x1,y1,x2,y2):  
     dist = math.sqrt((x2 - x1)**2 + (y2 - y1)**2)  
     return dist  
print calculateDistance(x1, y1, x2, y2)

or something like this, but can't figure out how to use this kind of function to compare and match coordinate pairs from two separate dataframes based on least distance. The real data set is also about 3 million entries, and I'm wondering what the least memory intensive way to do this would be.

suzanne
  • 39
  • 4
  • are your co-ordinates in this system? https://epsg.io/21781. if so it appears your sample data is outside CH – Rob Raymond Aug 25 '20 at 11:52
  • the coordinates are from this system: https://www.swisstopo.admin.ch/en/knowledge-facts/surveying-geodesy/reference-systems/switzerland.html - however I scrambled the pair in the example I posted since the data is somewhat sensitive. – suzanne Aug 25 '20 at 15:03
  • I believe I actually found a solution using complex numbers - I created a column for the coordinates as a complex number: 2669636+1262413j, then took the minimum distance between all possible pairs ot get matches. for small examples it seems to be working, I have yet to test on the full data set. – suzanne Aug 25 '20 at 15:05
  • interesting - I thought you were searching for finding nearest pairs, not a value that can be used in `merge_asof()`. have updated answer that does what you want with distance from geometry – Rob Raymond Aug 25 '20 at 15:50

1 Answers1

0

To use libraries to calculate distances you need to be on unified system. From google I believe you are using epsg:21781

  1. first standardise co-ordinate system using pyproj
  2. do a Cartesian product of colors and shapes
  3. calculate distance between these using geopy
  4. you can now select out resulting rows that you want. For purpose of example I've taken nearest when groups by color and shape
import pyproj, geopy.distance
df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
                    'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
                    'shape': ['square', 'square', 'triangle', 'circle']})
df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
                    'CoorN': [1262413, 1261211, 1261032, 1262550],
                    'color': ['purple', 'blue', 'blue', 'yellow']})


# assuming this co-ord system https://epsg.io/21781 then mapping to https://epsg.io/4326
sc = pyproj.Proj("epsg:21781")
dc = pyproj.Proj("epsg:4326")

df1 = df1.assign(
    shape_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["Ecode"], r["Ncode"]), axis=1)
)
df2 = df2.assign(
    color_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["CoorE"], r["CoorN"]), axis=1)
)

(df1
     .assign(foo=1)
     .merge(df2.assign(foo=1), on="foo")
     .assign(distance=lambda x: x.apply(lambda r: 
                                        geopy.distance.geodesic(r["color_gps"], r["shape_gps"]).km, axis=1))
     .sort_values("distance")
 .groupby(["color","shape"]).agg({"distance":"first","CoorE":"first","CoorN":"first"})
)

updated for nearest merge

If you pick a reference point to calculate distances, you get what you want.

import pyproj, geopy.distance
df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
                    'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
                    'shape': ['square', 'square', 'triangle', 'circle']})
df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
                    'CoorN': [1262413, 1261211, 1261032, 1262550],
                    'color': ['purple', 'blue', 'blue', 'yellow']})


# assuming this co-ord system https://epsg.io/21781 then mapping to https://epsg.io/4326
sc = pyproj.Proj("epsg:21781")
dc = pyproj.Proj("epsg:4326")
# pick a reference point for use in diatnace calcs
refpoint = pyproj.transform(sc, dc, df1.loc[0,["Ecode"]][0], df1.loc[0,["Ncode"]][0])

df1 = df1.assign(
    shape_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["Ecode"], r["Ncode"]), axis=1),
    distance=lambda x: x.apply(lambda r: geopy.distance.geodesic(refpoint, r["shape_gps"]).km, axis=1),
).sort_values("distance")
df2 = df2.assign(
    color_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["CoorE"], r["CoorN"]), axis=1),
    distance=lambda x: x.apply(lambda r: geopy.distance.geodesic(refpoint, r["color_gps"]).km, axis=1),
).sort_values("distance")

# no cleanup of columns but this works
pd.merge_asof(df1, df2, on="distance", direction="nearest")

Rob Raymond
  • 29,118
  • 3
  • 14
  • 30