python find duplicate values across multiple columns and ignore NaN

Question

I am newish to python and in I am over my head.

I have a dataframe that looks like this:

import pandas as pd
df = pd.DataFrame({'cityyear': ['chicago1990', 'detroit2000', 'detroit1999', 'chicago1999', 'detroit1990'], 
               'name1': ['hayden', 'charles', 'daniel', 'james', 'hayden']
               'name2': ['mary', 'mary', 'john',NaN, NaN]
               'name3': ['edward', 'reynolds', 'paula', NaN, NaN]
               'name4': ['charles', 'louse', 'reynolds', NaN, NaN]
               'name5': ['allan', 'william', 'romulus', NaN, NaN]
               'name6': ['edmund', 'elizabeth', 'edmund', NaN, NaN]
               'name7':['jane', NaN, 'william', NaN, NaN]})

I want to locate the moments when the same name name appears in the columns that begin with "name" and then return the two cityyears so it looks like this:

newdf=pd.DataFrame({'newcityyear': ['detroit2000','detroit2000'],
                'newcityear2': ['chicago1990', 'detroit1999'],
                'duplicatename1': ['charles', 'reynolds'],
                'duplicatename2': ['mary', 'william']})

I mainly can't figure out how to find the duplicates across multiple columns. I've started doing this:

    mask= df[['name1', 'name2', 'name3', 'name4', 'name5', 'name6', 'name7']].mask(df.duplicated(['name1', 'name2', 'name3', 'name4', 'name5', 'name6', 'name7']))

Which isn't working.

But neither is this:

    df[df.duplicated(['name1', 'name2', 'name3', 'name4', 'name5', 'name6', 'name7'], keep=False)]

Any help? I am not even sure how to deal with the fact that I've got NaN values that I don't want to count as duplicates. But also can't figure out how to find the duplicates across multiple columns.

read [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your question with a reproducible example please. We cant copy images to try your code, bdw try `df.loc[:,df.columns.str.startswith('name')].duplicated(keep=False)` — anky, Jun 15 '20 at 03:21
Thanks for editing your comment with the data in a better format (btw, you forgot commas to separate the columns when declaring df). Can you please elaborate more about what kind of duplication you're looking to find? I'm a little confused by your example. Are you trying to find any duplicate instances of a name across the multiple columns (holding 'cityyear' constant)? Or maybe you're trying to find if a particular name is duplicated in any given column? Your example doesn't really clarify it for me. — pyman, Jun 15 '20 at 04:12
trying to find if a particular name is duplicated across all columns that begin with "name." but finally figured out a version of how to do it. a wonky one but it worked. thanks! — joefromthebus, Jun 15 '20 at 20:26

score 0 · Accepted Answer · answered Jun 15 '20 at 05:03

Can you try this? First I am unstacking to find the names and only keeping the duplicate names after removing nulls. Then groupby the duplicate names and convert them into list then unlist those into columns

df = df.set_index(['cityyear']).unstack(['cityyear']).reset_index()
df = df[df[0].notnull()]
df = df[df.duplicated(subset = [0],keep=False)].sort_values([0])
df = df.groupby(['cityyear'])[0].agg(list).reset_index()
df.join(pd.DataFrame(df[0].tolist(),index=df.index).add_prefix('duplicatename'))



    cityyear    duplicatename0  duplicatename1  duplicatename2  duplicatename3
0   chicago1990 charles edmund  hayden  mary
1   detroit1990 hayden  None    None    None
2   detroit1999 edmund  reynolds    william None
3   detroit2000 charles mary    reynolds    william

Thhis is the dataframe i used.

df = pd.DataFrame({'cityyear': ['chicago1990', 'detroit2000', 'detroit1999', 'chicago1999', 'detroit1990'], 
               'name1': ['hayden', 'charles', 'daniel', 'james', 'hayden'],
               'name2': ['mary', 'mary', 'john',np.nan, np.nan],
               'name3': ['edward', 'reynolds', 'paula', np.nan, np.nan],
               'name4': ['charles', 'louse', 'reynolds', np.nan, np.nan],
               'name5': ['allan', 'william', 'romulus', np.nan, np.nan],
               'name6': ['edmund', 'elizabeth', 'edmund', np.nan, np.nan],
               'name7':['jane', np.nan, 'william', np.nan, np.nan]})

python find duplicate values across multiple columns and ignore NaN

1 Answers1