I have the following:
input df -
fruit uniqueid
apple 1123
appless 321
banana 623
mango 739
mangos 889
code -
df.loc[:,'fruit_copy'] = df['fruit']
## comparing values from one column to each other
compare = pd.MultiIndex.from_product([df['fruit'],df['fruit_copy']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare = compare.apply(metrics)
## only keep higher matches
compare_80 = compare[(compare['ratio'] >=80) & (compare['token'] >=80)]
current output -
ratio token
apple apple 100 100
appless 83 83
appless apple 83 83
appless 100 100
banana banana 100 100
mango mango 100 100
mangos 91 91
mangos mango 91 91
mangos 100 100
expected outcome first goal -
index1 index2 ratio token uniqueid
apple 1123 apple 100 100 1123
appless 83 83 321
appless 321 apple 83 83 1123
appless 100 100 321
banana 623 banana 100 100 632
mango 739 mango 100 100 739
mangos 91 91 889
mangos 889 mango 91 91 739
mangos 100 100 889
expected outcome second goal -
index1 index2 ratio token uniqueid
apple 1123 appless 83 83 321
mango 739 mangos 91 91 889
Can I achieve this by appending the uniqueid to the multivalue index?