9

I'm familiar with how to drop duplicate rows, and then using the parameter of first, last, none. Nothing too complicated with that and there's plenty of examples (ie here).

However, what I'm looking for is there a way to find the duplicates, but instead of dropping all duplicates and keeping the first, if I have duplicates, keep all duplicates but drop the first:

So instead of "drop if duplicates, keep the first", I want "keep if duplicates, drop first"

Example:

Given this dataframe, and looking at duplicates in cost column:

    ID name type cost
0    0    a   bb    1
1    1    a   cc    2 <--- there are duplicates, so drop this row
2  1_0    a   dd    2
3    2    a   ee    3 <--- there are duplicates, so drop this row
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

If there are duplicates in the cost column, just drop the first occurrence, but keep the rest.

So my output would be:

    ID name type cost
0    0    a   bb    1
2  1_0    a   dd    2
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

Heres the sample dataframe:

import pandas as pd

df = pd.DataFrame([
['0',   'a',    'bb',   '1'],
['1',   'a',    'cc',   '2'],
['1_0', 'a',    'dd',   '2'],
['2',   'a',    'ee',   '3'],
['2_0', 'a',    'ff',   '3'],
['2_1', 'a',    'gg',   '3'],
['2_2', 'a',    'hh',   '3']], columns = ['ID', 'name', 'type', 'cost'])
chitown88
  • 27,527
  • 4
  • 30
  • 59

4 Answers4

8

You can chain 2 masks created by DataFrame.duplicated with bitwise OR and filter by boolean indexing:

df = df[df.duplicated('cost') | ~df.duplicated('cost', keep=False)]
print (df)
    ID name type cost
0    0    a   bb    1
2  1_0    a   dd    2
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

Detail:

print (df.assign(mask1=df.duplicated('cost'), mask2=~df.duplicated('cost', keep=False)))
    ID name type cost  mask1  mask2
0    0    a   bb    1  False   True
1    1    a   cc    2  False  False
2  1_0    a   dd    2   True  False
3    2    a   ee    3  False  False
4  2_0    a   ff    3   True  False
5  2_1    a   gg    3   True  False
6  2_2    a   hh    3   True  False
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    ah ok. Then it just drops all rows where both are False. got it. – chitown88 Mar 12 '19 at 12:55
  • I'm not sure but I think this is meant to be `df[~df.duplicated('cost') & df.duplicated('cost', keep=False)]`. This achieves the question: "Keep all duplicates, but not the first instance" – defraggled Jan 08 '22 at 12:39
2

You can use groupby and pass a lambda function to grab the records after the first dupe if the dupe exists:

>>> df.groupby('cost').apply(lambda group: group.iloc[1:] if len(group) > 1 else group).reset_index(drop=True)
    ID  cost name type
0    0     1    a   bb
1  1_0     2    a   dd
2  2_0     3    a   ff
3  2_1     3    a   gg
4  2_2     3    a   hh
Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51
  • helpful for more custom cases where you want to keep only particular members of the group (by modifying `group.iloc` slice) – irene Aug 27 '22 at 04:53
1

You can do the following with the XOR (^) operator which looks for both conditions to be True. Since we use the NOT (~) operator. It looks for the opposite eg: both False:

df[~(df.cost.duplicated(keep=False) ^ df.cost.duplicated())]

Output

    ID name type cost
0    0    a   bb    1
2  1_0    a   dd    2
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3
Erfan
  • 40,971
  • 8
  • 66
  • 78
1

You can use the following code:

# Import pandas library 
import pandas as pd 

# initialize list of lists so i can create duplicate datas
data = [['tom', 10], ['nick', 15], ['juli', 14], ['nick', 15], ['julia', 140],
        ['tom', 10],['tom', 10],['tom', 10]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 

# print dataframe. 
print(df)

# Now the logic begins from here

colnames=[]

for col in df.columns:
    colnames.append(col)


listdf=df.values.tolist()
temp=[]

for i in range(0,len(listdf)):
    if(listdf.count(listdf[i])>1 and listdf[i] not in temp):
        temp.append(listdf[i])

df = pd.DataFrame(temp, columns =colnames)

print("dataframe with only duplciates ")
print(df)
Hamed
  • 5,867
  • 4
  • 32
  • 56
R.singh
  • 259
  • 1
  • 13