0

I have two dataframes df1 and df2.

df1 = pd.DataFramE({'Name': ['Zebra system','Lion healthcare'], 'Type': ['S','A']})
df2 = pd.DataFrame({'AltName': ['Zebra system llc','abra inc. 54','Lions corp health care','Zebra sys co','lions system atl'], 'Adr': ['45 main st','23 zoo ave', '12 zoo blvd.','56 veg st','23 peach st']})

Example above. df2 has about 300k records and df1 has about 10k records. I want to match df1 Name with df2 AltName and get a new dataframe with df1 and potential matched rows from df2 along with the score. I want to be able to have a score threshold that i can adjust, for instance add matches from df2 that are above 80.

This is what I have right now:

matched=pd.DataFrame({'df1-name':[],'df1-type':[],'df2-name':[],'df2-adr':[]})
for row in df1.index:
    first = df1.loc[row,"Name"]
    type1 = df1.loc[row,"Type"]
    for row2 in df2.index:
        second = df2.loc[row2,"AltName"]
        adr1 = df2.loc[row2,"Adr"]
        matched_token = fuzz.partial_ratio(first,second)
        if matched_token>60:
           matched.loc[row2,"df1-name"]=first
           matched.loc[row2,"df1-type"]=type1
           matched.loc[row2,"df2-name"]=second
           matched.loc[row2,"df2-adr"]=adr1

however this is very slow, code has been running for over 16hrs and going...

Zain
  • 1
  • 1
  • 1
    there's no quick way to do this unfortunately - (happy to be proven wrong) what have you tried so far? – Umar.H Apr 16 '20 at 18:30
  • I am new to python so I havent been able to get too far yet – Zain Apr 16 '20 at 18:39
  • read [this](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) and [this](https://pbpython.com/record-linking.html) – Umar.H Apr 16 '20 at 18:42
  • Does this answer your question? [is it possible to do fuzzy match merge with python pandas?](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) – Umar.H Apr 16 '20 at 18:42
  • i want to be able to use fuzz.partial_ratio so fuzzy join doesnt work – Zain Apr 16 '20 at 18:46
  • @Datanovice any ideas of making the above code faster? – Zain Apr 21 '20 at 19:10

0 Answers0