2

I got the task to highlight all email duplicates in a pandas data frame. Is there a function for this or a way to drop all the NON duplicates which leaves me with a nice list off all the duplicates in the dataset?

The table consists of six columns:

Email, FirstName, LastName, C_ID, A_ID, CreatedDate
a@a.com, Bill, Schneider, 123, 321, 20190502
a@a.com, Damian, Schneider, 124, 231, 20190502
b@b.com, Bill, Schneider, 164, 313, 20190503

I want to get rid of the last column as the last mail is NOT a duplicate.

smci
  • 32,567
  • 20
  • 113
  • 146
Lekü
  • 53
  • 1
  • 6
  • Define what you mean by 'duplicates': you only mean 'Email' is identical. Or you mean "either Email is identical, or both FirstName and LastName are identical"? (e.g. what if FirstName=='William' and LastName=='Schneider') – smci Jan 21 '21 at 18:30
  • *"The table consists of six columns"* ... *"I want to get rid of the last column..."* you mean 'row'! – smci Jan 21 '21 at 18:31
  • 1
    df.duplicated(keep=False) will give you the full list. If you want to keep only one row, you can use keep='first' will keep first one and mark others as duplicates. keep='last' does same and marks duplicates as True except for the last occurrence. If you want to check for specific column, then use subset=['colname1']. If you want to remove them, youncan use drop_duplicates(). See pandas documentation for more details on these two – Joe Ferndz Jan 21 '21 at 18:39
  • 1
    Guys please stop posting duplicate answers. SO already has [3881 Q&A on *\[pandas\] drop_duplicates*](https://stackoverflow.com/search?q=%5Bpandas%5D+drop_duplicates+), and more on *'unique'*, *'distinct'* etc. So, figure out which among those this question should be closed into. – smci Jan 21 '21 at 18:45

3 Answers3

3

Something like this might be the solution you're looking for:

import pandas as pd
series = [
    ('a@a.com','Bill', 'Schneider', 123, 321, 20190502),
    ('a@a.com', 'Damian', 'Schneider', 124, 231, 20190502),
    ('b@b.com', 'Bill', 'Schneider',164, 313, 20190503)
    ]

# Create a DataFrame object
df = pd.DataFrame(series, columns=['email', 'first name', 'last name', 'C_ID', 'A_ID', 'CreatedDate'])

# Find duplicate rows
df_duplicates = df[df.email.duplicated()]
print(df_duplicates)
Hayden Eastwood
  • 928
  • 2
  • 10
  • 20
-1
df = pd.DataFrame(table, columns = ['Email'])

df_duplicates_removed = pd.DataFrame.drop_duplicates(df)

(Where table is the name of your original dataframe).

-1

You could use value_counts

Which gives you the count for each email (as a Series). Then iterate through the series and drop any rows which contain only 1 value.

Full code something like:

for index, value in df.Email.value_counts().iteritems(): 
    if value == 1: 
        df = df[df.Email != index] 

UPDATE I didn't know about duplicated til pointed out, so it looks like the best way to do this is:

df[df.Email.duplicated(keep=False)] 
Hugh Ward
  • 91
  • 1
  • 6
  • 1
    Use duplicated instead – Joe Ferndz Jan 21 '21 at 18:31
  • `value_counts()` is just an inefficient, not-very-scaleable worse alternative than `drop_duplicates`/`duplicated`/`unique`. We don't need to count all the frequencies, just a binary of whether each value has count > 1 or not. – smci Jan 21 '21 at 19:33