Fuzzy match rows in single dataframe to find duplicates in pandas and python

Question

I stumbled across this post that I have been referencing: Apply fuzzy matching across a dataframe column and save results in a new column . The code I am referencing is in the answer section and uses fuzzy wuzzy and pandas. It uses fuzzy wuzzy to fund duplicate rows in 2 dataframes. I am aiming to modify this code so I can check for row duplicates in a single dataframe. Here is the code I have so far:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import SQLAlchemy
import pyodbc

con = 
sqlalchemy.create_engine('mssql+pyodbc://(localdb)\\LocalDBDemo/master? 
driver=ODBC+Driver+13+for+SQL+Server')

compare = pd.read_sql_table(PIM, con)

def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
                  fuzz.token_sort_ratio(*tup)],
                 ['ratio', 'token'])

compare.apply(metrics)
#df1
#compare.apply(metrics).unstack().idxmax().unstack(0)
#df2
#compare.apply(metrics).unstack(0).idxmax().unstack(0)

Any help would be appreciated! I am still very much a noob so please bear with me. Thanks!

score 1 · Answer 1 · answered Aug 20 '19 at 14:56

1

Try a package called pandas-dedupe. Uses fuzzy matching and machine learning.

https://pypi.org/project/pandas-dedupe/

I know your question is very old, but the above package may still be of assistance.

Did you find a solution in the end?

answered Aug 20 '19 at 14:56

SCool

3,104
4
21
49

score 0 · Answer 2 · answered Dec 19 '20 at 12:29

I agree with @SCool. pandas-dedupe is a very good library that can do exactly this. Here is an example:

import pandas as pd
import pandas_dedupe

df = pd.DataFrame({'name': ['Franc', 'Frank', 'John', 'Michelle', 'Charlotte', 'Carlotte', 'Jonh', 'Filipp', 'Charles', 'Diana', 'Robert', 'Carles', 'Michele']})

dd = pandas_dedupe.dedupe_dataframe(
    df, 
    field_properties = ['name'], 
    canonicalize=True
    )

pandas-dedupe will ask to label some examples as distinct or duplicates. Once done, it will take care of deduplication by returning the old name, canonicalised name as well as the confidence in the results.

I know that the question is old, but I hope that an example can help people find a solution to their problem quicker.

Fuzzy match rows in single dataframe to find duplicates in pandas and python

2 Answers2