fuzzy lookup between 2 series/df.columns

Question

based on this link I was trying to do a fuzzy lookup : Apply fuzzy matching across a dataframe column and save results in a new column between 2 dfs:

import pandas as pd
df1 = pd.DataFrame(data={'Brand_var':['Johnny Walker','Guiness','Smirnoff','Vat 69','Tanqueray']})
df2 = pd.DataFrame(data={'Product':['J.Walker Blue Label 12 CC','J.Morgan Blue Walker','Giness blue 150 CC','tqry qiuyur qtre','v69 g nesscom ui123']})

I have 2 dfs df1 and df2 which needs to be mapped via a fuzzy lookup/any other method which suits.

Below is the code I am using:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
compare = pd.MultiIndex.from_product([df1['Brand_var'],
                                      df2['Product']]).to_series()
def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])
compare.apply(metrics)
df = compare.apply(metrics).unstack().idxmax().unstack(0)
print(df)

Below is my output:

                             ratio       token
----------------------------------------------------------
Giness blue 150 CC         Guiness      Guiness
J.Morgan Blue Walker       Johnny Walker Johnny Walker 
J.Walker Blue Label 12 CC  Johnny Walker Johnny Walker 
tqry qiuyur qtre           Tanqueray     Tanqueray
v69 g nesscom ui123        Guiness       Guiness

Expected output:

                             ratio       token
----------------------------------------------------------
Giness blue 150 CC          Guiness       Guiness
J.Morgan Blue Walker        None          None
J.Walker Blue Label 12 CC   Johnny Walker Johnny Walker 
tqry qiuyur qtre            Tanqueray     Tanqueray
v69 g nesscom ui123         Vat 69        Vat 69

Any suggestions what could be a better approach(not using fuzzy wuzzy is also fine) to get my desired output?

Thank you in advance. :)

@jezrael : no, i want to know if any better approach exists via fuzzywuzzy or pandas to get to the desired output..? :) — anky, Aug 30 '18 at 09:16
hmm, pandas have no similar method implemented, maybe need different library like `fuzzywuzzy`. I have no experience with it, but I hope get some answer. But you can also explain why there are `None`s ? Because `fuzzywuzzy ` return some value always, so never get it (but maybe i am wrong). — jezrael, Aug 30 '18 at 09:25
@jezrael : by the none i meant that that column shouldnot be mapped. May be that should not exist in the output df. If not possible I am fine with any mapping but the rest of the mapping are the most important. — anky, Aug 30 '18 at 09:27
@jezrael : any breakthroughs? do you think running a fuzzy match str.search and the output of this fuzzywuzzy code might be a better idea? — anky, Aug 30 '18 at 13:33
Unfortunately I have no experience in this field, so no answer and no solutiion for you :( — jezrael, Aug 30 '18 at 13:34
Something like if product contains words like "Giness" or "Gines" or "Guiness" then brand_var is "Guiness". 2nd rule: If the product contains "69" then brand as "Vat 69" and so on. You have to build them according to your data — Rahul Agarwal, Oct 04 '18 at 09:43
@RahulAgarwal: i would definitely be interested whenever you have time. :) Thanks a lot — anky, Oct 04 '18 at 10:14

score 1 · Answer 1 · answered Oct 04 '18 at 10:58

The below code with rules will give you expected output:

import pandas as pd
from fuzzywuzzy import fuzz
df1 = pd.DataFrame(data={'Brand_var':['Johnny Walker','Guiness','Smirnoff','Vat 69','Tanqueray']})
df2 = pd.DataFrame(data={'Product':['J.Walker Blue Label 12 CC','J.Morgan Blue Walker','Giness blue 150 CC','tqry qiuyur qtre','v69 g nesscom ui123']})

Guiness_Beer = ["Giness","Guiness","Gines"]
Johnny_Walker = ["J.Walker","J.walker"]
Tanqueray     =["tqry","Tanqueray","tquery"]
Vat = ["69","Vat69","Vat 69"]

matched_names = []

for row in df1.index:
    brand_name = df2.get_value(row,"Product")
    Rule_Guiness = any(word in brand_name for word in Guiness_Beer)
    Rule_Johnny_Walker = any(word in brand_name for word in Johnny_Walker)
    Rule_Tanqueray = any(word in brand_name for word in Tanqueray)
    Rule_Vat = any(word in brand_name for word in Vat)
    if Rule_Guiness:
        matched_names.append([brand_name,"Guiness"])
    elif Rule_Johnny_Walker:
        matched_names.append([brand_name,"Johnny Walker"])
    elif Rule_Tanqueray:
        matched_names.append([brand_name,"Tanqueray"])
    elif Rule_Vat:
        matched_names.append([brand_name,"Vat 69"])
    else:
        matched_names.append([brand_name,"None"])


df = pd.DataFrame(columns=['Product', 'Brand'], data=matched_names)

You can do more modifications in this like all the dictionaries like Guiness_beer etc. can be configured through excel and you don't have to touch the code if in future you want to add/subtract/modify any keyword.

Thanks, however what if I dont want to hardcode any specific brand? I mean I cannot append all the names since there are 1000s of them — anky, Oct 04 '18 at 11:30
It basically depends on your rules!! How to do want to cluster them? You can also make cluster of brands, store them and pass them as your are passing products! — Rahul Agarwal, Oct 04 '18 at 11:46

fuzzy lookup between 2 series/df.columns

1 Answers1

Linked