Pandas - Opposite of drop duplicates, keep first

Question

I'm familiar with how to drop duplicate rows, and then using the parameter of first, last, none. Nothing too complicated with that and there's plenty of examples (ie here).

However, what I'm looking for is there a way to find the duplicates, but instead of dropping all duplicates and keeping the first, if I have duplicates, keep all duplicates but drop the first:

So instead of "drop if duplicates, keep the first", I want "keep if duplicates, drop first"

Example:

Given this dataframe, and looking at duplicates in cost column:

    ID name type cost
0    0    a   bb    1
1    1    a   cc    2 <--- there are duplicates, so drop this row
2  1_0    a   dd    2
3    2    a   ee    3 <--- there are duplicates, so drop this row
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

If there are duplicates in the cost column, just drop the first occurrence, but keep the rest.

So my output would be:

    ID name type cost
0    0    a   bb    1
2  1_0    a   dd    2
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

Heres the sample dataframe:

import pandas as pd

df = pd.DataFrame([
['0',   'a',    'bb',   '1'],
['1',   'a',    'cc',   '2'],
['1_0', 'a',    'dd',   '2'],
['2',   'a',    'ee',   '3'],
['2_0', 'a',    'ff',   '3'],
['2_1', 'a',    'gg',   '3'],
['2_2', 'a',    'hh',   '3']], columns = ['ID', 'name', 'type', 'cost'])

score 8 · Accepted Answer · answered Mar 12 '19 at 12:48

You can chain 2 masks created by DataFrame.duplicated with bitwise OR and filter by boolean indexing:

df = df[df.duplicated('cost') | ~df.duplicated('cost', keep=False)]
print (df)
    ID name type cost
0    0    a   bb    1
2  1_0    a   dd    2
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

Detail:

print (df.assign(mask1=df.duplicated('cost'), mask2=~df.duplicated('cost', keep=False)))
    ID name type cost  mask1  mask2
0    0    a   bb    1  False   True
1    1    a   cc    2  False  False
2  1_0    a   dd    2   True  False
3    2    a   ee    3  False  False
4  2_0    a   ff    3   True  False
5  2_1    a   gg    3   True  False
6  2_2    a   hh    3   True  False

ah ok. Then it just drops all rows where both are False. got it. — chitown88, Mar 12 '19 at 12:55
I'm not sure but I think this is meant to be `df[~df.duplicated('cost') & df.duplicated('cost', keep=False)]`. This achieves the question: "Keep all duplicates, but not the first instance" — defraggled, Jan 08 '22 at 12:39

score 2 · Answer 2 · answered Mar 12 '19 at 13:05

2

You can use groupby and pass a lambda function to grab the records after the first dupe if the dupe exists:

>>> df.groupby('cost').apply(lambda group: group.iloc[1:] if len(group) > 1 else group).reset_index(drop=True)
    ID  cost name type
0    0     1    a   bb
1  1_0     2    a   dd
2  2_0     3    a   ff
3  2_1     3    a   gg
4  2_2     3    a   hh

answered Mar 12 '19 at 13:05

Scratch'N'Purr

9,959
2
35
51

helpful for more custom cases where you want to keep only particular members of the group (by modifying `group.iloc` slice) – irene Aug 27 '22 at 04:53

Erfan · Answer 3 · 2019-03-12T12:58:10.180

1

You can do the following with the XOR (^) operator which looks for both conditions to be True. Since we use the NOT (~) operator. It looks for the opposite eg: both False:

df[~(df.cost.duplicated(keep=False) ^ df.cost.duplicated())]

Output

    ID name type cost
0    0    a   bb    1
2  1_0    a   dd    2
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

edited Mar 12 '19 at 12:58

answered Mar 12 '19 at 12:51

Erfan

40,971
8
66
78

score 1 · Answer 4 · edited Feb 04 '20 at 14:15

You can use the following code:

# Import pandas library 
import pandas as pd 

# initialize list of lists so i can create duplicate datas
data = [['tom', 10], ['nick', 15], ['juli', 14], ['nick', 15], ['julia', 140],
        ['tom', 10],['tom', 10],['tom', 10]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 

# print dataframe. 
print(df)

# Now the logic begins from here

colnames=[]

for col in df.columns:
    colnames.append(col)


listdf=df.values.tolist()
temp=[]

for i in range(0,len(listdf)):
    if(listdf.count(listdf[i])>1 and listdf[i] not in temp):
        temp.append(listdf[i])

df = pd.DataFrame(temp, columns =colnames)

print("dataframe with only duplciates ")
print(df)

Pandas - Opposite of drop duplicates, keep first

4 Answers4