6

I have a dataset :

id    url     keep_if_dup
1     A.com   Yes
2     A.com   Yes
3     B.com   No
4     B.com   No
5     C.com   No

I want to remove duplicates, i.e. keep first occurence of "url" field, BUT keep duplicates if the field "keep_if_dup" is YES.

Expected output :

id    url     keep_if_dup
1     A.com   Yes
2     A.com   Yes
3     B.com   No
5     C.com   No

What I tried :

Dataframe=Dataframe.drop_duplicates(subset='url', keep='first')

which of course does not take into account "keep_if_dup" field. Output is :

id    url     keep_if_dup
1     A.com   Yes
3     B.com   No
5     C.com   No
Vincent
  • 1,534
  • 3
  • 20
  • 42

1 Answers1

6

You can pass multiple boolean conditions to loc, the first keeps all rows where col 'keep_if_dup' == 'Yes', this is ored (using |) with the inverted boolean mask of whether col 'url' column is duplicated or not:

In [79]:
df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()]

Out[79]:
   id    url keep_if_dup
0   1  A.com         Yes
1   2  A.com         Yes
2   3  B.com          No
4   5  C.com          No

to overwrite your df self-assign back:

df = df.loc[(df['keep_if_dup'] =='Yes') | ~df['url'].duplicated()]

breaking down the above shows the 2 boolean masks:

In [80]:
~df['url'].duplicated()

Out[80]:
0     True
1    False
2     True
3    False
4     True
Name: url, dtype: bool

In [81]:
df['keep_if_dup'] =='Yes'

Out[81]:
0     True
1     True
2    False
3    False
4    False
Name: keep_if_dup, dtype: bool
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks, does this command modify "df" dataframe or should I use : df = df.loc[...] to save it ? – Vincent Jul 26 '16 at 08:00
  • if you want to overwrite your df then self-assign back so `df= df.loc....` – EdChum Jul 26 '16 at 08:01
  • OK thats what I meant, thanks for editing your answer – Vincent Jul 26 '16 at 08:14
  • @EdChum this is a fantastic answer that helped me solve a similar problem. I'm still trying to wrap my head around the the or inverted boolean expression part. Does this say `if keep_if_dup == YES or df['url] is not duplicated`? I was looking for a solution similar to `if keep_if_dup == 'YES' then drop duplicates`, so just wrapping my mind around it. Thank you v much! – Iris D Jul 27 '21 at 15:30