How to Conditionally Remove Duplicates from Pandas DataFrame with a List

Question

I have a df, and want to remove all duplicates on ID.

       Name     Symbol         ID
0   ZOO INC     Remove  88579Y101
1   Zoo Inc        ZZZ  88579Y101
2     A Inc        AAA  90138A103
3     a inc.    Remove  90138A103
4    2U Inc       TWUO  90214J101
5      Keep     Remove  111111111

But I only want to remove the duplicate rows where Symbol == 'Remove'. The output should look like:

       Name     Symbol         ID
0   Zoo Inc        ZZZ  88579Y101
1     A Inc        AAA  90138A103
2    2U Inc       TWUO  90214J101
3      Keep     Remove  111111111

I can't use result_df = df.drop_duplicates(subset=['ID'], keep='first') (or keep='last') because the dataset doesn't have a specific pattern. And sorting alphabetically first won't help either.

And while I know I can replace all Remove with NaN, and then use the solution provided here, I am looking for an alternate solution because I may eventually need to pass a list of strings.

Does Pandas support anything like: result_df = df.drop_duplicates(subset=['ID'], keep=(df['Symbol'] != 'Remove'))?

But doesn't that remove all rows where `Symbol == 'Remove`? I only want to remove those that are duplicates on `ID`. I will update the question to make this clearer. — user53526356, Mar 22 '20 at 15:59
That doesn't reflect on your sample data. But then again, jezrael's answer would already covers it. — Quang Hoang, Mar 22 '20 at 16:06

jezrael · Accepted Answer · 2020-03-22T16:58:14.077

4

Use Series.duplicated with keep=False for all dupes and chain with compare for Remove, chain together by | for bitwise OR and invert mask by ~:

m1 = df['ID'].duplicated(keep=False)
m2 = (df['Symbol'] == 'Remove')

df = df[~(m1 & m2)]

print (df)
      Name Symbol         ID
1  Zoo Inc    ZZZ  88579Y101
2    A Inc    AAA  90138A103
4   2U Inc   TWUO  90214J101
5     Keep     Remove  111111111

edited Mar 22 '20 at 16:58

answered Mar 22 '20 at 15:39

jezrael

822,522
95
1,334
1,252

1

You may want to also show ID 111111111 in your output (which is the tricky part, but your technique produces this so my upvote). – DarrylG Mar 22 '20 at 16:24

How to Conditionally Remove Duplicates from Pandas DataFrame with a List

1 Answers1