I have two dataframes (~10k row each) and I want to find the best match for each entry in the other.
Specifically, here is an example:
For each entry in df1 col user
, I want to find the best fuzzy match in the corresponding col in df2. Then I want to include the location of the matching entry and the matched entry in the final dataframe.
import numpy as np
np.random.seed(123)
df1 = pd.DataFrame({'user': ["aparna", "pankaj", "sudhir", "Geeku"],
'location': np.random.choice( [5,7,3], 4)})
df2 = pd.DataFrame({'user': ["aparn", "arup", "Pankaj", "sudhir c", "Geek", "abc"],
'location': np.random.choice( [5,7,3], 6)})
Each dataframe look like this:
user location
0 aparn 5
1 arup 3
2 Pankaj 3
3 sudhir c 7
4 Geek 3
5 abc 7
And the final result look like this
matching_user location1 matched_user location
0 aparn 5 aparna 7
1 pankaj 3 Pankaj 5
2 sudhir 7 sudhir c 7
...