I am trying to merge two dataframes based on matches between pairs of column values. However, the column values are not exact from one dataframe to the next. The pairs are coordinates using the Swiss coordinate system, but measured from a slightly different reference point in each df.
This stackoverflow thread How to find the distance between 2 points in 2 different dataframes in pandas? seems to be a related query, but unfortunately I don't fully understand the response.
Example for my data:
df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
'shape': ['square', 'square', 'triangle', 'circle']})
df1
Ecode Ncode shape
0 2669827.294 1261034.528 square
1 2669634.483 1262412.587 square
2 2669766.266 1261209.646 triangle
3 2669960.683 1262550.374 circle
df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
'CoorN': [1262413, 1261211, 1261032, 1262550],
'color': ['purple', 'blue', 'blue', 'yellow']})
df2
CoorE CoorN color
0 2669636 1262413 purple
1 2669765 1261211 blue
2 2669827 1261032 blue
3 2669961 1262550 yellow
I have data I would like to compare located with both sets of coordinates (ex. 'shape' and 'color'). My desired outcome matches the column pairs on the closest match:
CoorE CoorN color shape
0 2669636 1262413 purple square
1 2669765 1261211 blue triangle
2 2669827 1261032 blue square
3 2669961 1262550 yellow circle
Is there a way to do this? I have tried to use merge_asof but realized it can't key on two variables. I have also seen threads computing this based on latitude and longitude. I can write a function that treats CoorE/CoorN and Ecode/Ncode as x/y coordinates, and calculates the distance between a pair of coordinates (probably there is a better way, but I am new to this):
import math
def calculateDistance(x1,y1,x2,y2):
dist = math.sqrt((x2 - x1)**2 + (y2 - y1)**2)
return dist
print calculateDistance(x1, y1, x2, y2)
or something like this, but can't figure out how to use this kind of function to compare and match coordinate pairs from two separate dataframes based on least distance. The real data set is also about 3 million entries, and I'm wondering what the least memory intensive way to do this would be.