How do I get a list of all the duplicate items using pandas in python?

Question

I have a list of items that likely has some export issues. I would like to get a list of the duplicate items so I can manually compare them. When I try to use pandas duplicated method, it only returns the first duplicate. Is there a a way to get all of the duplicates and not just the first one?

A small subsection of my dataset looks like this:

ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12

My code looks like this currently:

df_bigdata_duplicates = df_bigdata[df_bigdata.duplicated(cols='ID')]

There area a couple duplicate items. But, when I use the above code, I only get the first item. In the API reference, I see how I can get the last item, but I would like to have all of them so I can visually inspect them to see why I am getting the discrepancy. So, in this example I would like to get all three A036 entries and both 11795 entries and any other duplicated entries, instead of the just first one. Any help is most appreciated.

"Duplicates" can mean various things" In your case, you only want to consider **duplicates in a single column `ID`**, not "rows identical in multiple or all columns". — smci, May 31 '20 at 02:39

score 288 · Accepted Answer · edited Jun 08 '22 at 05:15

288

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

edited Jun 08 '22 at 05:15

Test

571
13
32

answered Feb 02 '13 at 01:01

DSM

342,061
65
592
494

9

Method #2 fails ("No objects to concatenate") if there are no dups – CPBL May 29 '17 at 23:11
8

what does `g for _` do? – user77005 Jan 05 '18 at 04:01
12

@user77005 you might've figured out already, but for everyone's benefit, it reads like this: `g for (placeholder, g) in df.groupby('bla') if 'bla'`; the underscore is a typical symbol for placeholder of an inevitable argument where we don't want to use it for anything in a lambda-like expression. – stucash Feb 01 '18 at 00:15
19

Method #1 needs to be updated: `sort` was deprecated for DataFrames in favor of either `sort_values` or `sort_index` [Related SO Q&A](https://stackoverflow.com/questions/44123874/dataframe-object-has-no-attribute-sort) – tatlar Nov 22 '18 at 18:27
1

I have tried this and the snippets provided in this thread, **they all return different results**. Can anyone share which is more appropriate to use? https://stackoverflow.com/questions/57909119/to-sort-group-and-display-duplicated-values-of-a-column – Organic Heart Sep 13 '19 at 01:06
method 2 is great for small datasets, but very quickly the memory explodes on bigger dataframes. – Shoval Sadde Sep 01 '21 at 07:00
Im trying to group by customer_id, product_id and price and get an error when running the below pd.concat(g for _, g in df.groupby("customer_id", "product_id", "price") if len(g) > 1) – Jag99 Sep 27 '21 at 07:40
method 2 is way slower than method 1 – Sad Pencil Dec 09 '22 at 14:52

score 258 · Answer 2 · answered Oct 28 '15 at 01:10

258

With Pandas version 0.17, you can set 'keep = False' in the duplicated function to get all the duplicate items.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(['a','b','c','d','a','b'])

In [3]: df
Out[3]: 
       0
    0  a
    1  b
    2  c
    3  d
    4  a
    5  b

In [4]: df[df.duplicated(keep=False)]
Out[4]: 
       0
    0  a
    1  b
    4  a
    5  b

answered Oct 28 '15 at 01:10

user666

5,231
2
26
35

10

Bingo, there's the answer. So: str or str or boolean... odd API choice. `'all'` would be more logical and intuitive IMO. – Jarad Jan 10 '18 at 23:39
6

@Jarad You don't find it to be intuitive that `keep=False` means, "yes, keep everything"? Strange. /s – Connor Feb 02 '21 at 00:56

score 246 · Answer 3 · edited Aug 23 '22 at 17:12

246

df[df.duplicated(['ID'], keep=False)]

it'll return all duplicated rows back to you.

According to documentation:

keep : {‘first’, ‘last’, False}, default ‘first’

'first' : Mark duplicates as True except for the first occurrence.

'last' : Mark duplicates as True except for the last occurrence.

False : Mark all duplicates as True.

edited Aug 23 '22 at 17:12

smci

32,567
20
113
146

answered Jan 22 '17 at 02:50

Kelly ChowChow

2,587
1
8
6

Deepak · Answer 4 · 2022-02-19T18:33:37.390

28

As I am unable to comment, hence posting as a separate answer

To find duplicates on the basis of more than one column, mention every column name as below, and it will return you all the duplicated rows set:

df[df[['product_uid', 'product_title', 'user']].duplicated() == True]

Alternatively,

df[df[['product_uid', 'product_title', 'user']].duplicated()]

edited Feb 19 '22 at 18:33

answered Jan 05 '19 at 09:47

Deepak

667
8
16

score 18 · Answer 5 · answered Nov 19 '18 at 15:06

18

df[df['ID'].duplicated() == True]

This worked for me

answered Nov 19 '18 at 15:06

Hariprasad

1,611
2
14
19

17

You actually do not have to add `== True`, `.duplicated()` already returns bool array. – Jakub Wagner Apr 09 '20 at 07:36

Nafeez Quraishi · Answer 6 · 2019-07-25T20:40:01.217

6

sort("ID") does not seem to be working now, seems deprecated as per sort doc, so use sort_values("ID") instead to sort after duplicate filter, as following:

df[df.ID.duplicated(keep=False)].sort_values("ID")

edited Jul 25 '19 at 20:40

answered Jul 25 '19 at 20:31

Nafeez Quraishi

5,380
2
27
34

score 5 · Answer 7 · edited Mar 23 '22 at 15:51

5

You could use:

df[df.duplicated(['ID'])==True].sort_values('ID')

duplicated rows and their index loc # for all column values

def dup_rows_index(df):
  dup = df[df.duplicated()]
  print('Duplicated index loc:',dup[dup == True ].index.tolist())
  return dup

edited Mar 23 '22 at 15:51

answered Apr 09 '19 at 07:53

PREM JILLA

51
1
1

4

Please, can you extend your answer with more detailed explanation? This will be very useful for understanding. Thank you! – vezunchik Apr 09 '19 at 07:56
Welcome to Stack Overflow and thanks for your contribution! It would be kind if you could extend you answer by an explanation. Here you find a guide [How to give a good answer](https://stackoverflow.com/help/how-to-answer). Thanks! – David Apr 09 '19 at 09:03

score 4 · Answer 8 · edited Dec 06 '18 at 05:33

4

Using an element-wise logical or and setting the take_last argument of the pandas duplicated method to both True and False you can obtain a set from your dataframe that includes all of the duplicates.

df_bigdata_duplicates = 
    df_bigdata[df_bigdata.duplicated(cols='ID', take_last=False) |
               df_bigdata.duplicated(cols='ID', take_last=True)
              ]

edited Dec 06 '18 at 05:33

yoonghm

4,198
1
32
48

answered Jun 17 '15 at 13:38

Oshbocker

41
1

score 4 · Answer 9 · answered Dec 05 '18 at 21:12

This may not be a solution to the question, but to illustrate examples:

import pandas as pd

df = pd.DataFrame({
    'A': [1,1,3,4],
    'B': [2,2,5,6],
    'C': [3,4,7,6],
})

print(df)
df.duplicated(keep=False)
df.duplicated(['A','B'], keep=False)

The outputs:

   A  B  C
0  1  2  3
1  1  2  4
2  3  5  7
3  4  6  6

0    False
1    False
2    False
3    False
dtype: bool

0     True
1     True
2    False
3    False
dtype: bool

score 4 · Answer 10 · edited Aug 23 '22 at 17:11

4

For my database .duplicated(keep=False) did not work until the column was sorted.

data.sort_values(by=['Order ID'], inplace=True)
df = data[data['Order ID'].duplicated(keep=False)]

edited Aug 23 '22 at 17:11

smci

32,567
20
113
146

answered Jan 26 '20 at 20:07

LetzerWille

5,355
4
23
26

score 2 · Answer 11 · answered Aug 23 '22 at 10:26

This code gives you a data frame indicating if a row has any repetition in the data frame:

df2 = df1.duplicated()

This code eliminates the duplications and keeps only one instance:

df3 = df1.drop_duplicates(keep="first")

df3 will be a data frame consisting from the unique items (rows).

score 0 · Answer 12 · answered Apr 14 '22 at 22:44

0

Inspired by the solutions above, you can further sort values so that you can look at the records that are duplicated sorted:

df[df.duplicated(['ID'], keep=False)].sort_values(by='ID')

answered Apr 14 '22 at 22:44

Taie

1,021
16
29

score 0 · Answer 13 · answered Jan 26 '23 at 22:35

TL;DR

This worked for me:

dups = [i for i, v in df["Col1"].value_counts().iteritems() if v > 1]
dups

Output:

[501, 505]

To list duplicate rows:

fltr = df["Col1"].isin(dups)  # Filter
df[fltr]

Output:

   Col1  Col2
0   501     D
1   501     H
2   505     E
3   501     E
4   505     M

Explanation:

Taking value_counts() of a column, say Col1, returns a Series with:

Data of column Col1 as the Series index.
Count of values in Col1 as the Series values.

For example value_counts() on below DataFrame:

	Col1	Col2
0	501	D
1	501	H
2	505	E
3	501	E
4	505	M
5	502	A
6	503	N

df["Col1"].value_counts()

Outputs below Series:

501    3
505    2
502    1
503    1
Name: Col1, dtype: int64

Now using iteritems() we can access both index and values of a Series object:

dups = [i for i, v in df["Col1"].value_counts().iteritems() if v > 1]
dups

Output:

[501, 505]

Now use the duplicate values captured as filter on original DataFrame.

fltr = df["Col1"].isin(dups)  # Filter
df[fltr]

   Col1  Col2
0   501     D
1   501     H
2   505     E
3   501     E
4   505     M

How do I get a list of all the duplicate items using pandas in python?

13 Answers13

TL;DR

Explanation:

Linked

Related