5

How can use fuzzy matching in pandas to detect duplicate rows (efficiently)

enter image description here

How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?

fgregg
  • 3,173
  • 30
  • 37
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • FuzzyWuzzy is an implementation of edit distance, which would be a good candidate for building a pairwise distance matrix in numpy or similar. to detect "duplicates" or near matches, you'll have to at least make the comparison from each row to the other rows or you'll never know if two are close to each other. see http://stackoverflow.com/questions/24089973/python-numpy-pairwise-edit-distance for a solution using pdist in scipy. – binaryaaron Sep 14 '16 at 13:00
  • You could potentially approximate it -- see http://cs.stackexchange.com/questions/2093/efficient-map-data-structure-supporting-approximate-lookup/2096#2096 – binaryaaron Sep 14 '16 at 13:06
  • or get fancy: https://en.wikipedia.org/wiki/BK-tree. Not sure if any of those are particularly helpful for your case. – binaryaaron Sep 14 '16 at 13:07
  • Thanks - I will need to look into that. Would you recommend to perform the distance operation row-wise or would you suggest to "add" up the distances of each field? – Georg Heiler Sep 14 '16 at 13:59
  • This seems to be intersting https://gist.github.com/nibogd/94363e93f4e0256b4665eb743dbfa211 - they mention the indexing time is slow but surely not as slow as n^2? – Georg Heiler Sep 14 '16 at 14:05
  • I updated the notebook and wonder why I cannot set an arbitrary string distance function e.g. one from fuzzywuzzy as a distance metric – Georg Heiler Sep 14 '16 at 15:01
  • @mwormser which element would you consider for the root or would you create a separate tree per row? – Georg Heiler Sep 14 '16 at 16:17
  • I found https://github.com/ekzhu/datasketch/blob/master/README.md and https://cran.r-project.org/web/packages/textreuse/vignettes/textreuse-minhash.html for now I will look a bit more into the python variant – Georg Heiler Sep 14 '16 at 18:42
  • You can use Scikit-learn for that. they have a LSH feature hasher that works well with strings. I thought you wanted to use edit distance, but standard similarity search might work well for you. good luck. – binaryaaron Sep 14 '16 at 18:57
  • Not necessarily. Just want to find the duplicates. Would you suggest to only use lsh or the combination with min hash – Georg Heiler Sep 14 '16 at 19:25

3 Answers3

5

Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.

fgregg
  • 3,173
  • 30
  • 37
2

pandas-dedupe is your friend here. You can try to do the following:

import pandas as pd
from pandas_deudpe import dedupe_dataframe

df = pd.DataFrame.from_dict({'bank':['bankA', 'bankA', 'bankB', 'bankX'],'email':['email1', 'email1', 'email2', 'email3'],'name':['jon', 'john', 'mark', 'pluto']})

dd = dedupe_dataframe(df, ['bank', 'name', 'email'], sample_size=1)

If you also want to set a canonical name to same entitites, set canonicalize=True.

[I'm one of pandas-dedupe contributors]

iEriii
  • 403
  • 2
  • 7
0

There is now a package to make it easier to use the dedupe library with pandas: pandas-dedupe

(I am a developer of the original dedupe library, but not the pandas-dedupe package)

fgregg
  • 3,173
  • 30
  • 37