2

I am trying to merge 2 dataframes with multiple columns each based on matching values at one of the columns on each of them. This code from @Erfan does a great job fuzzymatching the target columns, but is there a way to carry the rest of columns too. https://stackoverflow.com/a/56315491/12802642

Dataframe

df1 = pd.DataFrame({'Key':['Apple Souce', 'Banana', 'Orange', 'Strawberry', 'John tabel']})
df2 = pd.DataFrame({'Key':['Aple suce', 'Mango', 'Orag','Jon table', 'Straw', 'Bannanna', 'Berry'],
                    'Key23':['1', '2', '3','4', '5', '6', '7'})

Match function from @Erfan as described in link above

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

Calling the function

df = fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80, limit=1)
df.sort_values(by='Key',ascending=True).reset_index()

Result

index   Key            matches
0       Apple Souce    Aple suce
1       Banana         Bannanna
2       John tabel  
3       Orange  
4       Strawberry     Straw

Desired result

index   Key            matches       Key23
0       Apple Souce    Aple suce     1
1       Banana         Bannanna      6
2       John tabel                   
3       Orange                       
4       Strawberry     Straw         5
pyproper
  • 53
  • 6
  • 1
    Welcome to Stack Overflow! Could you please ensure that your question adheres to [How to ask?](https://stackoverflow.com/help/how-to-ask) guideline? Specifically, please provide a precise information on what you have already tried and what you are trying to accomplish. – sophros Feb 24 '20 at 16:19
  • @sophros Thanks! Just updated the post. – pyproper Feb 26 '20 at 16:38

1 Answers1

1

For those who need this. Here's a solution I came up with.
merge = pd.merge(df, df2, left_on=['matches'],right_on=['Key'],how='outer').fillna(0)
From there you can drop unnecessary or duplicate columns and get a clean result like so:
clean = merge.drop(['matches', 'Key_y'], axis=1)

pyproper
  • 53
  • 6