3

I have two dataframes, both have an ID and a Column Name that contains Strings. They might look like this:

Dataframes:

DF-1                              DF-2
---------------------             ---------------------
     ID          Name                  ID          Name
1    56       aaeessa             1    12    H.P paRt 1 
2    98       1o7v9sM             2    76       aa3esza
3   175    HP. part 1             3   762    stakoverfl 
4     2     stackover             4     2       lo7v9Sm

I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:

Result:

Result
-----------------
     ID1     ID2
1    56       76
2    98        2
3   175       12
4     2      762

This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby and idxmax for the similarity score but I can't quite come up with the soltution by myself.

EDIT: The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1 has more rows than DF-2).

3 Answers3

4

Using the pandas dedupe package: https://pypi.org/project/pandas-dedupe/

You need to train the classifier with human input and then it will use the learned setting to match the whole dataframe.

first pip install pandas-dedupe and try this:

import pandas as pd
import pandas_dedupe

df1=pd.DataFrame({'ID':[56,98,175],
                 'Name':['aaeessa', '1o7v9sM', 'HP. part 1']})

df2=pd.DataFrame({'ID':[12,76,762,2],
                 'Name':['H.P paRt 1', 'aa3esza', 'stakoverfl ', 'lo7v9Sm']})


#initiate matching
df_final = pandas_dedupe.link_dataframes(df1, df2, ['Name'])

# reset index
df_final = df_final.reset_index(drop=True)

# print result

print(df_final)

    ID        Name  cluster id  confidence
0   98     1o7v9sm         0.0    1.000000
1    2     lo7v9sm         0.0    1.000000
2  175  hp. part 1         1.0    0.999999
3   12  h.p part 1         1.0    0.999999
4   56     aaeessa         2.0    0.999967
5   76     aa3esza         2.0    0.999967
6  762  stakoverfl         NaN         NaN

you can see matched pairs are assigned a cluster and confidence level. unmatched are nan. you can now analyse this info however you wish. perhaps only take results with a confidence level above 80% for example.

SCool
  • 3,104
  • 4
  • 21
  • 49
  • While the output is not exactly the format you wanted, I'm sure you could create another stackoverflow question to get it into the format needed. I wasn't able to do it! (still learning) – SCool Sep 20 '19 at 10:20
  • I'm using Anaconda on Windows and can't get this package to work, installed it via pip3 cause the conda version is only for linux and the package somehowcan't be found :S – Ferran Capallera Guirado Sep 20 '19 at 12:46
  • I am using it on windows 10 also with anaconda. What's the error message? – SCool Sep 20 '19 at 12:54
  • What happens when you type `pip install pandas-dedupe` ? – SCool Sep 20 '19 at 12:54
  • It works fine, but then looks like the package can't be found on jupyter notebook if they are not installed through conda. I don't know if it something path-related, this is what my PATH looks right now: [Path](https://gyazo.com/713c7775ea0defd64273bbfd6a01ff3c) – Ferran Capallera Guirado Sep 20 '19 at 13:02
  • I have no idea what's causing the problem, it looks like you have to start googling. Did you install it into a different environment by any chance? By the way I don't have any python stuff in my PATH. I use Anaconda Prompt. – SCool Sep 20 '19 at 13:08
  • I don't have any environments... The thing is pip3 and `conda install` install the packages in different directories, and somehow packages installed through pip3 can't be used in jupyter notebook for me. Thanks for the answer tho, i'll start to google. – Ferran Capallera Guirado Sep 20 '19 at 13:26
  • 1
    **I'll correct and point out what my mistake was:** Using pip3 installs packages in a different directory. Using pip in anaconda promp works, now it's working! – Ferran Capallera Guirado Sep 20 '19 at 13:30
1

I suggest you a library called Python Record Linkage Toolkit.

Once you import the library, you must index the sources you intend to compare, something like this:

 indexer = recordlinkage.Index()

 #using url as intersection
 indexer.block('id')
 candidate_links = indexer.index(df_1, df_2)

 c = recordlinkage.Compare()

Let's say you want to compare based on the similiraties of strings, but they don't match exactly:

 c.string('name', 'name', method='jarowinkler', threshold=0.85)

And if you want an exact match you should use:

c.exact('name')
powerPixie
  • 718
  • 9
  • 20
1

Using my fuzzy_wuzzy function from the linked answer:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

mrg = fuzzy_merge(df1, df2, 'Name', 'Name', threshold=70)\
      .merge(df2, left_on='matches', right_on='Name', suffixes=['1', '2'])\
      .filter(like='ID')

Output

   ID1  ID2
0   56   76
1   98    2
2  175   12
3    2  762
Erfan
  • 40,971
  • 8
  • 66
  • 78