1

I have the following:

input df -

fruit  uniqueid 
apple   1123
appless 321
banana  623
mango   739
mangos  889

code -

df.loc[:,'fruit_copy'] = df['fruit']
## comparing values from one column to each other
compare = pd.MultiIndex.from_product([df['fruit'],df['fruit_copy']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare = compare.apply(metrics)
## only keep higher matches
compare_80 = compare[(compare['ratio'] >=80) & (compare['token'] >=80)]

current output -

                ratio   token
apple   apple   100     100
        appless 83      83
appless apple   83      83
        appless 100     100
banana  banana  100     100
mango   mango   100     100
        mangos  91      91
mangos  mango   91      91
        mangos  100     100

expected outcome first goal -

        index1  index2          ratio token uniqueid 
        apple   1123   apple    100   100   1123  
                       appless  83    83    321
        appless 321    apple    83    83    1123
                       appless  100   100   321
        banana  623    banana   100   100   632
        mango   739    mango    100   100   739
                       mangos   91    91    889
        mangos  889    mango    91    91    739
                       mangos   100   100   889

expected outcome second goal -

        index1  index2          ratio token uniqueid 
        apple   1123   appless  83    83    321  
        mango   739    mangos   91    91    889
        

Can I achieve this by appending the uniqueid to the multivalue index?

a1234
  • 761
  • 3
  • 12
  • 23
  • Yes I was using https://stackoverflow.com/questions/54865890/fuzzy-match-strings-in-one-column-and-create-new-dataframe-using-fuzzywuzzy & https://stackoverflow.com/questions/38577332/apply-fuzzy-matching-across-a-dataframe-column-and-save-results-in-a-new-column as my guidance – a1234 Mar 10 '21 at 17:47
  • Thank you for the advice - added input df – a1234 Mar 10 '21 at 17:52
  • `current output` doesnot match the actual output, please run again – anky Mar 10 '21 at 17:59
  • 1
    Thank you, I reran & updated – a1234 Mar 10 '21 at 18:29

1 Answers1

1

You can try doing this with a cross merge and applying a fuzz ratio later:

s = df['fruit'].str[:2] #if you know how many start char should atleast match(assume 2)

u = df.assign(k=1,s=s).merge(df.drop('uniqueid',1).assign(k=1,s=s)
    ,on=['k','s'],suffixes=('','_y')).drop(['k','s'],1)

u = u[u['fruit'].ne(u['fruit_y'])].copy() #removing same combinations

u = (u.assign(Ratio=[fuzz.ratio(*i) for i in zip(u['fruit'],u['fruit_y'])])
       .sort_values('Ratio',ascending=False).drop_duplicates('fruit')).sort_index()

out = (u[pd.DataFrame(np.sort(u[['fruit','fruit_y']],axis=1),index=u.index)
      .duplicated(keep='last')])

print(out)

   fruit  uniqueid  fruit_y  Ratio
1  apple      1123  appless     83
6  mango       739   mangos     91
anky
  • 74,114
  • 11
  • 41
  • 70