I have two dataframes, both have an ID and a Column Name
that contains Strings. They might look like this:
Dataframes:
DF-1 DF-2
--------------------- ---------------------
ID Name ID Name
1 56 aaeessa 1 12 H.P paRt 1
2 98 1o7v9sM 2 76 aa3esza
3 175 HP. part 1 3 762 stakoverfl
4 2 stackover 4 2 lo7v9Sm
I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:
Result:
Result
-----------------
ID1 ID2
1 56 76
2 98 2
3 175 12
4 2 762
This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby
and idxmax
for the similarity score but I can't quite come up with the soltution by myself.
EDIT: The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1
has more rows than DF-2
).