4

I have a df, and want to remove all duplicates on ID.

       Name     Symbol         ID
0   ZOO INC     Remove  88579Y101
1   Zoo Inc        ZZZ  88579Y101
2     A Inc        AAA  90138A103
3     a inc.    Remove  90138A103
4    2U Inc       TWUO  90214J101
5      Keep     Remove  111111111

But I only want to remove the duplicate rows where Symbol == 'Remove'. The output should look like:

       Name     Symbol         ID
0   Zoo Inc        ZZZ  88579Y101
1     A Inc        AAA  90138A103
2    2U Inc       TWUO  90214J101
3      Keep     Remove  111111111

I can't use result_df = df.drop_duplicates(subset=['ID'], keep='first') (or keep='last') because the dataset doesn't have a specific pattern. And sorting alphabetically first won't help either.

And while I know I can replace all Remove with NaN, and then use the solution provided here, I am looking for an alternate solution because I may eventually need to pass a list of strings.

Does Pandas support anything like: result_df = df.drop_duplicates(subset=['ID'], keep=(df['Symbol'] != 'Remove'))?

user53526356
  • 934
  • 1
  • 11
  • 25

1 Answers1

4

Use Series.duplicated with keep=False for all dupes and chain with compare for Remove, chain together by | for bitwise OR and invert mask by ~:

m1 = df['ID'].duplicated(keep=False)
m2 = (df['Symbol'] == 'Remove')

df = df[~(m1 & m2)]

print (df)
      Name Symbol         ID
1  Zoo Inc    ZZZ  88579Y101
2    A Inc    AAA  90138A103
4   2U Inc   TWUO  90214J101
5     Keep     Remove  111111111
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    You may want to also show ID 111111111 in your output (which is the tricky part, but your technique produces this so my upvote). – DarrylG Mar 22 '20 at 16:24