How to conditionally remove duplicates from a pandas dataframe

Question

Consider the following dataframe

import pandas as pd
df = pd.DataFrame({'A' : [1, 2, 3, 3, 4, 4, 5, 6, 7],
                   'B' : ['a','b','c','c','d','d','e','f','g'],
                   'Col_1' :[np.NaN, 'A','A', np.NaN, 'B', np.NaN, 'B', np.NaN, np.NaN],
                   'Col_2' :[2,2,3,3,3,3,4,4,5]})
df
Out[92]: 
    A  B Col_1  Col_2
 0  1  a   NaN      2
 1  2  b     A      2
 2  3  c     A      3
 3  3  c   NaN      3
 4  4  d     B      3
 5  4  d   NaN      3
 6  5  e     B      4
 7  6  f   NaN      4
 8  7  g   NaN      5

I want to remove all rows which are duplicates with regards to column 'A' 'B'. I want to remove the entry which has a NaN entry (I know that for all dulicates there will be a NaN and a not-NaN entry). The end results should look like this

    A  B Col_1  Col_2
 0  1  a   NaN      2
 1  2  b     A      2
 2  3  c     A      3
 4  4  d     B      3
 6  5  e     B      4
 7  6  f   NaN      4
 8  7  g   NaN      5

All efficient, one-liners are most welcome

In the down voter's defence, you could have chosen an example where simply `df.drop_duplicates` gave you the answer you did not want. — cs95, Aug 27 '17 at 16:51

cs95 · Answer 1 · 2017-08-27T16:46:35.320

9

If the goal is to only drop the NaN duplicates, a slightly more involved solution is needed.

First, sort on A, B, and Col_1, so NaNs are moved to the bottom for each group. Then call df.drop_duplicates with keep=first:

out = df.sort_values(['A', 'B', 'Col_1']).drop_duplicates(['A', 'B'], keep='first')
print(out)

   A  B Col_1  Col_2
0  1  a   NaN      2
1  2  b     A      2
2  3  c     A      3
4  4  d     B      3
6  5  e     B      4
7  6  f   NaN      4
8  7  g   NaN      5

edited Aug 27 '17 at 16:46

answered Aug 27 '17 at 16:44

cs95

379,657
97
704
746

Maybe :) but does this guarantee that the non-`NaN` values of `Col_1` are the rows that are kept? – mortysporty Aug 27 '17 at 16:45
1

@mortysporty Edited. – cs95 Aug 27 '17 at 16:46

cmaher · Accepted Answer · 2017-08-27T18:53:12.607

Here's an alternative:

df[~((df[['A', 'B']].duplicated(keep=False)) & (df.isnull().any(axis=1)))]
#    A  B Col_1  Col_2
# 0  1  a   NaN      2
# 1  2  b     A      2
# 2  3  c     A      3
# 4  4  d     B      3
# 6  5  e     B      4
# 7  6  f   NaN      4
# 8  7  g   NaN      5

This uses the bitwise "not" operator ~ to negate rows that meet the joint condition of being a duplicate row (the argument keep=False causes the method to evaluate to True for all non-unique rows) and containing at least one null value. So where the expression df[['A', 'B']].duplicated(keep=False) returns this Series:

# 0    False
# 1    False
# 2     True
# 3     True
# 4     True
# 5     True
# 6    False
# 7    False
# 8    False

...and the expression df.isnull().any(axis=1) returns this Series:

# 0     True
# 1    False
# 2    False
# 3     True
# 4    False
# 5     True
# 6    False
# 7     True
# 8     True

... we wrap both in parentheses (required by Pandas syntax whenever using multiple expressions in indexing operations), and then wrap them in parentheses again so that we can negate the entire expression (i.e. ~( ... )), like so:

~((df[['A','B']].duplicated(keep=False)) & (df.isnull().any(axis=1))) & (df['Col_2'] != 5)

# 0     True
# 1     True
# 2     True
# 3    False
# 4     True
# 5    False
# 6     True
# 7     True
# 8    False

You can build more complex conditions with further use of the logical operators & and | (the "or" operator). As with SQL, group your conditions as necessary with additional parentheses; for instance, filter based on the logic "both condition X AND condition Y are true, or condition Z is true" with df[ ( (X) & (Y) ) | (Z) ].

I think you need to pass `keep=False` to `duplicated` for this to work. — ayhan, Aug 27 '17 at 16:52
Hi. Would this work if there was some other value than `NaN` we wanted to get rid of? Could we just fix the argument after `&`? — mortysporty, Aug 27 '17 at 17:10
@mortysporty yes, that's basically right -- I should caveat, though, that depending on how you're testing for that value, it's probably easiest if you un-group the conditions (i.e. remove the outer parentheses) so that you can do something like `~(df.duplicated) & (df.Col_2 != 5)`. If you directly substitute `df.Col_2 != 5` into the one-liner above, it will be negated (i.e. True becomes False and vice-versa) because of the way the two current conditions are grouped inside `~( ... )`. — cmaher, Aug 27 '17 at 19:03

score 3 · Answer 3 · answered Aug 27 '17 at 17:45

Or you can just using first(), by using the first , will give back the first notnull value, so the order of original input does not really matter.

df.groupby(['A','B']).first()

Out[180]: 
    Col_1  Col_2
A B             
1 a   NaN      2
2 b     A      2
3 c     A      3
4 d     B      3
5 e     B      4
6 f   NaN      4
7 g   NaN      5

How to conditionally remove duplicates from a pandas dataframe

3 Answers3

Linked