Remove one of duplicate value in two columns of dataframe

Question

I am working on google collaboratory and I have two column on panda dataframe which some of the rows has similar value

A   B
Syd Syd
Aus Del
Mir Ard
Dol Dol

I wish that the value in column B which has duplicate value with column A to be deleted, like below :

A   B
Syd 
Aus Del
Mir Ard
Dol

I try to use drop_duplicates() like this one Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C but it will delete the entire column B. Any suggestions smarter ways to solve this problem?

Thanks before!

score 2 · Accepted Answer · answered Nov 21 '21 at 08:58

2

There is no need to use drop_duplicates, you can simply compare the column A with B, then mask the values in B where they are equal to A

df['B'] = df['B'].mask(df['A'].eq(df['B']))

Alternatively you can also use boolean indexing with loc to mask the duplicated values

df.loc[df['A'].eq(df['B']), 'B'] = np.nan

     A    B
0  Syd  NaN
1  Aus  Del
2  Mir  Ard
3  Dol  NaN

answered Nov 21 '21 at 08:58

Shubham Sharma

Hi ! Thanks for the answer, it really a very simple and minimalistic approach. Just want to make sure, after we mask the duplicated value with NaN we can use `df.fillna('',inplace=True)` to remove all the NaN. Is that right? – Fendy Sion 04 Nov 21 '21 at 16:53
Glad it worked. You can mask with empty string instead of `nan`, no need to use `fillna` afterwards. Check `df.loc[df['A'].eq(df['B']), 'B'] = ''` – Shubham Sharma Nov 21 '21 at 17:12
It works like magic! Every duplicate value now disappear. Many thanks for the answer. Let me vote your comments as the best answer. Best regards – Fendy Sion 04 Nov 21 '21 at 22:17

1 Answers1