I am newish to python and in I am over my head.
I have a dataframe that looks like this:
import pandas as pd
df = pd.DataFrame({'cityyear': ['chicago1990', 'detroit2000', 'detroit1999', 'chicago1999', 'detroit1990'],
'name1': ['hayden', 'charles', 'daniel', 'james', 'hayden']
'name2': ['mary', 'mary', 'john',NaN, NaN]
'name3': ['edward', 'reynolds', 'paula', NaN, NaN]
'name4': ['charles', 'louse', 'reynolds', NaN, NaN]
'name5': ['allan', 'william', 'romulus', NaN, NaN]
'name6': ['edmund', 'elizabeth', 'edmund', NaN, NaN]
'name7':['jane', NaN, 'william', NaN, NaN]})
I want to locate the moments when the same name name appears in the columns that begin with "name" and then return the two cityyears so it looks like this:
newdf=pd.DataFrame({'newcityyear': ['detroit2000','detroit2000'],
'newcityear2': ['chicago1990', 'detroit1999'],
'duplicatename1': ['charles', 'reynolds'],
'duplicatename2': ['mary', 'william']})
I mainly can't figure out how to find the duplicates across multiple columns. I've started doing this:
mask= df[['name1', 'name2', 'name3', 'name4', 'name5', 'name6', 'name7']].mask(df.duplicated(['name1', 'name2', 'name3', 'name4', 'name5', 'name6', 'name7']))
Which isn't working.
But neither is this:
df[df.duplicated(['name1', 'name2', 'name3', 'name4', 'name5', 'name6', 'name7'], keep=False)]
Any help? I am not even sure how to deal with the fact that I've got NaN values that I don't want to count as duplicates. But also can't figure out how to find the duplicates across multiple columns.