0

I have two files that contains the full names of people from two different sources. I want to create a column that will determine whether the names match, have additional names or no match. I have merged both files using a common ID. How do I do this in python? enter image description here

I tried this code but it just showed exact match and every other thing as no match.

import numpy as np
df['Match1'] = np.where(
    df['t1_full_name'].str.split().apply(sorted).apply(Counter)
    == df['t2_full_name'].str.split().apply(sorted).apply(Counter),
    'match', 'no match'
)
Tryph
  • 5,946
  • 28
  • 49
  • 1
    StackOverflow is a platform to ask questions about precise programming topics, not a place where you can ask for someone to write a code. please share more information about you data format and what you tried for now. – Tryph Jul 20 '22 at 09:41
  • @Tryph I used this code but it just showed the exact match and every other thing not a match. import numpy as np df['Match1'] = np.where(df['t1_full_name'].str.split().apply(sorted).apply(Counter) == df['t2_full_name'].str.split().apply(sorted).apply(Counter), 'match', 'no match') – Ivy Odametey Jul 20 '22 at 10:50

3 Answers3

0

Pandas does the job assuming you have your data into a DataFrame. If not you can easily convert it from excel or csv or sql.

String Search inspired here.

Notes: Logical operators for Boolean indexing in Pandas


import pandas as pd

df = pd.DataFrame({'name1': ['Mary', 'John', 'Tom'], 'name2': [
                  'Mary Poppins', 'John', 'Carol']})

df['Match'] = (df['name1'] == df['name2'])
df['Inside'] = (df['name1'].str.contains("|".join(list(df['name2'].values)))
                | df['name2'].str.contains("|".join(list(df['name1'].values))))
            # isin was not working with strings

df['Result'] = None #Generates an empty column
df['Result'].loc[~(df['Match'] & df['Inside'])] = "Do not Match"
df['Result'].loc[(df['Match'] & df['Inside'])] = "Match"
df['Result'].loc[(~df['Match'] & df['Inside'])] = "Contains aditional names"
df.drop(['Match','Inside'],axis=1,inplace=True) #Droping intermediate columns
print(df)

  name1         name2                    Result
0  Mary  Mary Poppins  Contains aditional names
1  John          John                     Match
2   Tom         Carol              Do not Match
  • Your code really works well. The only exception is when the names are interchanged eg (Adam Bay and Bay Adam) , it shows as contains additional names. – Ivy Odametey Jul 20 '22 at 11:04
  • I have been able to resolve the issue with the interchanged names. I just added a filter to the match column. Thanks a lot. This is the code: df2['Match'] = (df2['t1_full_name'].str.split().apply(sorted) == df2['t2_full_name'].str.split().apply(sorted)). – Ivy Odametey Jul 20 '22 at 11:44
  • I just realised there are some names that do not match at all but the result column assigns 'Contains additional Names'. Any way around it? – Ivy Odametey Jul 20 '22 at 13:13
  • That's fairly odd. It would be good if you could provide an example. I would start by looking at 'Match' and 'Inside' Columns to see if it makes sense. – Daniel Gonçalves Jul 20 '22 at 21:57
  • An example is name1(Catherine Anderson) and name2(Jean Wendel). The result column shows "contains additional columns" – Ivy Odametey Jul 21 '22 at 09:57
  • I have just run that example and the code returned "Do not Match" as expected. Make sure that the line that sets Contains Additional names is correct, namely that the `~`is inside the parentheses and not the other way around. – Daniel Gonçalves Jul 21 '22 at 10:14
0

This is very easy with pandas first convert your column to a set of each part of name. use from apply method, it is optimal for this.

then compare two columns based on isdisjoint method in set class and also check the equality with first strings columns.

This way is very optimal for large datasets

Alireza75
  • 513
  • 1
  • 4
  • 19
0

I finally found my way around it.

    t1_name_set = set(t1_name.split(" "))    
    t2_name_set = set(t2_name.split(" "))
    are_subsets = t1_name_set.issubset(t2_name_set) or t1_name_set.issubset(t2_name_set) 
    if are_subsets:
        if (len(t1_name_set) == len(t2_name_set)):
            return "Match"
        else:
            return "Contains Additional Name"
    else:
        return "Do not Match"
    matching_series = []
for index, row in df.iterrows():
   match_outcome = is_match(row['t1_full_name'], row['t2_full_name'])
   matching_series.append(match_outcome)
   
df["Match"] = matching_series
df    
  • If column1 contains `I am good` and column2 contains `I am good good` what will be the output of `set` in this case? – GodWin1100 Jul 21 '22 at 17:41