Concatenate two dataframes and remove duplicate rows based on column value

Question

I have two dataframes.

df1:

    Name Symbol         ID
0    Jay    N/A    372Y105
1    Ray    N/A    4446100
2   Faye    N/A    484MAA4
3   Maye    N/A    504W308
4    Kay    N/A    782L107
5   Trey    FFF    782L111

df2:

    Name Symbol         ID
0    Jay    AAA    372Y105
1   Faye    CCC    484MAA4
2    Kay    EEE    782L107

If ID matches between df1 and df2, I want to replace the symbol in df1 with the symbol from df2, so that the output looks like:

    Name Symbol         ID
0    Jay    AAA    372Y105
1    Ray    N/A    4446100
2   Faye    CCC    484MAA4
3   Maye    N/A    504W308
4    Kay    EEE    782L107
5   Trey    FFF    782L111

It sounds like I should first concatenate the two dataframes, and then drop duplicates somehow, e.g.,

df3 = pd.concat([df1, df2])
df3 = df3.drop_duplicates(subset='ID', keep='last')

But rather than only keep the first or last duplicate, I only want to remove the ones where symbol = N/A.

In first DataFrame are only `NaN`s for `Symbol` column? Then use `df = df1.drop('Symbol', axis=1).merge(df2, on=['Name','ID'], how='left')` — jezrael, Jan 05 '20 at 15:53
No first Dataframe can have values other than `N/A` too. I've updated the original to reflect that. — user53526356, Jan 05 '20 at 15:55

jezrael · Accepted Answer · 2020-01-05T16:03:23.463

Use merge with left join first and then replace missing values in Symbol column by Symbol_ column:

print (df1.merge(df2, on=['Name','ID'], how='left', suffixes=('', '_')))
   Name Symbol       ID Symbol_
0   Jay    NaN  372Y105     AAA
1   Ray    NaN  4446100     NaN
2  Faye    NaN  484MAA4     CCC
3  Maye    NaN  504W308     NaN
4   Kay    NaN  782L107     EEE
5  Trey    FFF  782L111     NaN

df = (df1.merge(df2, on=['Name','ID'], how='left', suffixes=('', '_'))
         .assign(Symbol = lambda x: x['Symbol'].fillna(x.pop('Symbol_'))))
print (df)
   Name Symbol       ID
0   Jay    AAA  372Y105
1   Ray    NaN  4446100
2  Faye    CCC  484MAA4
3  Maye    NaN  504W308
4   Kay    EEE  782L107
5  Trey    FFF  782L111

Another solution with DataFrame.update:

df1 = df1.set_index(['Name','ID'])
df2 = df2.set_index(['Name','ID'])
df1.update(df2)
df1 = df1.reset_index()
print (df1)
   Name       ID Symbol
0   Jay  372Y105    AAA
1   Ray  4446100    NaN
2  Faye  484MAA4    CCC
3  Maye  504W308    NaN
4   Kay  782L107    EEE
5  Trey  782L111    FFF

When I do this on my full dataset it just returns `df1`. Only difference are the column names. — user53526356, Jan 05 '20 at 16:02
@MSD - Added output after `merge` and also another solution, can you check it? — jezrael, Jan 05 '20 at 16:05

Concatenate two dataframes and remove duplicate rows based on column value

1 Answers1