How can I find duplicates in a pandas data frame?

Question

I got the task to highlight all email duplicates in a pandas data frame. Is there a function for this or a way to drop all the NON duplicates which leaves me with a nice list off all the duplicates in the dataset?

The table consists of six columns:

Email, FirstName, LastName, C_ID, A_ID, CreatedDate
a@a.com, Bill, Schneider, 123, 321, 20190502
a@a.com, Damian, Schneider, 124, 231, 20190502
b@b.com, Bill, Schneider, 164, 313, 20190503

I want to get rid of the last column as the last mail is NOT a duplicate.

Define what you mean by 'duplicates': you only mean 'Email' is identical. Or you mean "either Email is identical, or both FirstName and LastName are identical"? (e.g. what if FirstName=='William' and LastName=='Schneider') — smci, Jan 21 '21 at 18:30
*"The table consists of six columns"* ... *"I want to get rid of the last column..."* you mean 'row'! — smci, Jan 21 '21 at 18:31
df.duplicated(keep=False) will give you the full list. If you want to keep only one row, you can use keep='first' will keep first one and mark others as duplicates. keep='last' does same and marks duplicates as True except for the last occurrence. If you want to check for specific column, then use subset=['colname1']. If you want to remove them, youncan use drop_duplicates(). See pandas documentation for more details on these two — Joe Ferndz, Jan 21 '21 at 18:39
Guys please stop posting duplicate answers. SO already has [3881 Q&A on *\[pandas\] drop_duplicates*](https://stackoverflow.com/search?q=%5Bpandas%5D+drop_duplicates+), and more on *'unique'*, *'distinct'* etc. So, figure out which among those this question should be closed into. — smci, Jan 21 '21 at 18:45

score 3 · Answer 1 · answered Jan 21 '21 at 18:28

Something like this might be the solution you're looking for:

import pandas as pd
series = [
    ('a@a.com','Bill', 'Schneider', 123, 321, 20190502),
    ('a@a.com', 'Damian', 'Schneider', 124, 231, 20190502),
    ('b@b.com', 'Bill', 'Schneider',164, 313, 20190503)
    ]

# Create a DataFrame object
df = pd.DataFrame(series, columns=['email', 'first name', 'last name', 'C_ID', 'A_ID', 'CreatedDate'])

# Find duplicate rows
df_duplicates = df[df.email.duplicated()]
print(df_duplicates)

score -1 · Answer 2 · answered Jan 21 '21 at 17:16

-1

df = pd.DataFrame(table, columns = ['Email'])

df_duplicates_removed = pd.DataFrame.drop_duplicates(df)

(Where table is the name of your original dataframe).

answered Jan 21 '21 at 17:16

wolfhound99

52
3

Hugh Ward · Answer 3 · 2021-01-25T20:02:06.927

-1

You could use value_counts

Which gives you the count for each email (as a Series). Then iterate through the series and drop any rows which contain only 1 value.

Full code something like:

for index, value in df.Email.value_counts().iteritems(): 
    if value == 1: 
        df = df[df.Email != index]

UPDATE I didn't know about duplicated til pointed out, so it looks like the best way to do this is:

df[df.Email.duplicated(keep=False)]

edited Jan 25 '21 at 20:02

answered Jan 21 '21 at 17:38

Hugh Ward

91
1
6

1

Use duplicated instead – Joe Ferndz Jan 21 '21 at 18:31
`value_counts()` is just an inefficient, not-very-scaleable worse alternative than `drop_duplicates`/`duplicated`/`unique`. We don't need to count all the frequencies, just a binary of whether each value has count > 1 or not. – smci Jan 21 '21 at 19:33

How can I find duplicates in a pandas data frame?

3 Answers3