How to delete duplicates pandas

Question

I need to check if there are some duplicates value in one column of a dataframe using Pandas and, if there is any duplicate, delete the entire row. I need to check just the first column.

Example:

object    type

apple     fruit
ball      toy
banana    fruit
xbox      videogame
banana    fruit
apple     fruit

What i need is:

object    type

apple     fruit
ball      toy
banana    fruit
xbox      videogame

I can delete the 'object' duplicates with the following code, but I can't delete the entire row that contains the duplicate as the second column won't be deleted.


df = pd.read_csv(directory, header=None,)

objects= df[0]

for object in df[0]:

Potential duplicate of: https://stackoverflow.com/questions/50885093/how-do-i-remove-rows-with-duplicate-values-of-columns-in-pandas-data-frame — , Jun 15 '21 at 15:47

score 0 · Accepted Answer · answered Jun 15 '21 at 15:43

0

Select by duplicated mask and negate it

df = df[~df["object"].duplicated()]

Which gives

   object       type
0   apple      fruit
1    ball        toy
2  banana      fruit
3    xbox  videogame

answered Jun 15 '21 at 15:43

crayxt

2,367
2
12
17

score 0 · Answer 2 · answered Jun 15 '21 at 15:45

use drop_duplicates method

d = pd.DataFrame(
    {'object': ['apple', 'ball', 'banana', 'xbox', 'banana', 'apple'],
    'type': ['fruit', 'toy', 'fruit', 'videogame', 'fruit', 'fruit']}
)
d.drop_duplicates()

there are several keyword args. that might come in handy (like inplace=True if you want your dataframe d to be updated)

SeaBean · Answer 3 · 2021-06-15T15:53:25.750

0

You can use .drop_duplicates() with parameter subset='object' to select the column you want to check, as follows:

df_out = df.drop_duplicates(subset='object')

Result:

print(df_out)

   object       type
0   apple      fruit
1    ball        toy
2  banana      fruit
3    xbox  videogame

edited Jun 15 '21 at 15:53

answered Jun 15 '21 at 15:47

SeaBean

22,547
3
13
25

score 0 · Answer 4 · answered Sep 15 '22 at 04:55

0

To get the length after dropping duplicates

df = len(df)-len(df.drop_duplicates())

answered Sep 15 '22 at 04:55

Derrick Kuria

159
1
10

How to delete duplicates pandas

4 Answers4