0

I'm very new to Python. How can I match one text dataframe to another? (kindly please edit this question if I ask this wrongly)

For example given this input data:

 df1 =
          id  Names 
        0 123 Simpson J.
        1 456 Snoop Dogg

 df2 =
            Names 
         0  John Simpson
         1  Snoop Dogg
         2  S. Dogg
         3  Mr Dogg

Is there a way I could find (maybe using findall or match, or any python packages) so that I could produce how many times the names with the id has appeared which almost like this result:

result = 
              id  Names_appeared 
            0 123   1 
            1 456   3

Looking for a brief explanation and some URL to help me understand.

smci
  • 32,567
  • 20
  • 113
  • 146
Ralph Deint
  • 380
  • 1
  • 4
  • 15
  • 1
    Can you explain the `result` you've not explained how the `Names` are matching each other as it looks like you're doing some kind of fuzzy match or just matching on any word in `df1` in any words in the `Names` col in `df2`? – EdChum May 20 '16 at 09:24
  • @EdChum I'm trying to see, if there is some match maybe with more than 70% match like Snoop Dogg from 'df1' match almost 80% with S. Dogg from 'df2' and Mr. Dogg we can remove the "Mr." which will left with just Dogg which match 70% from Snoop Dogg 'df1' – Ralph Deint May 20 '16 at 09:35
  • Again what determines 70/80% match? Jaccard similarity? character % match, levenshstein distance? etc you need to explain better – EdChum May 20 '16 at 09:37
  • A look at Python's [`fuzzywuzzy` module](https://pypi.python.org/pypi/fuzzywuzzy) would be a good start. – IanS May 20 '16 at 09:49
  • @EdChum I don't know how to explain the actual percentage, but looking at the dataframe I just gave you the estimation percentage of the similarities of the data. Btw can you briefly explain/provide url for the jaccard, character%match etc. So that I can look into it and have some idea to elaborate on the question. Fuzzy matching seems like a good module, but I want your opinion also on this. Thank you so much. – Ralph Deint May 20 '16 at 14:10
  • Do you mean **compare dataframes for equality**? Do you only want to **compare their shape and values**, or also **each column datatypes** (e.g. so string '123' != integer '123' or float '123.'), or also **column names**? Do you want a rough comparison or an exact comparison? i.e. can it fail if there are a few extra rows? even containing NA/NaN/NaTs? or categorical levels that are missing in first dataframe? duplicate rows? etc. – smci Mar 05 '20 at 03:57
  • In your example, you're **not** comparing the entire dataframe `df1` to `df2`. You're only comparing the column `df1['Names']` to `df2['Names']`, in a normalized, duplicate-aware way; you have to normalize `John Simpson` to `Simpson J.`; do you just assume firstname-lastname are interchangeable? Also, you're throwing away the id columns. – smci Mar 05 '20 at 04:01
  • Related: [Searching one Python dataframe / dictionary for fuzzy matches in another dataframe](https://stackoverflow.com/questions/41455093/searching-one-python-dataframe-dictionary-for-fuzzy-matches-in-another-datafra) – smci Mar 06 '20 at 08:21
  • Related: [Pandas fuzzy merge/match name column, with duplicates](https://stackoverflow.com/questions/19964546/pandas-fuzzy-merge-match-name-column-with-duplicates) – smci Mar 06 '20 at 08:29

1 Answers1

1

Here's an example using fuzzywuzzy as IanS suggested:

import pandas as pd
from fuzzywuzzy import fuzz


def fuzz_count(shortList, longList, minScore):
    """ Count fuzz ratios greater than or equal to a minimum score. """
    results = []
    for s in shortList:
        scores = [fuzz.ratio(s, l) for l in longList]
        count = sum(x >= minScore for x in scores)
        results.append(count)
    return results


data1 = {'id': pd.Series([123, 456]),
         'Names': pd.Series(['Simpson J.', 'Snoop Dogg'])}
data2 = {'Names': pd.Series(['John Simpson', 'Snoop Dogg', 'S. Dogg', 'Mr Dogg'])}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

result = pd.DataFrame()
result['id'] = df1['id']
counts = fuzz_count(df1['Names'], df2['Names'], minScore=60)  # [1, 2]
result['Names_appeared'] = counts

print(result)  #     id  Names_appeared
               # 0  123               1
               # 1  456               2
brennan
  • 3,392
  • 24
  • 42