Pandas: Get duplicated indexes

Question

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.

Specifically, I have this dataframe:

import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)

In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False

And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).

I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.

To simplify even further, if we only have the index and the repeat type,

genome_location1    MIR3
genome_location1    AluJb
genome_location2    Tigger1
genome_location3    AT_rich

So the output I'd like to see all duplicate indexes and their repeat types, as such:

genome_location1    MIR3
genome_location1    AluJb

EDIT: added toy example

Hi, usually it's good practice to simplify question as much as possible and create toy example with input and desired output. Such a questions are answered much faster and will be useful for future readers. — Roman Pekar, Nov 25 '13 at 17:17

Matthew · Answer 1 · 2016-06-06T11:12:05.590

84

Also useful and very succinct:

df[df.index.duplicated()]

Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:

df[df.index.duplicated(keep=False)]

edited Jun 06 '16 at 11:12

answered Jun 06 '16 at 10:17

Matthew

10,361
5
42
54

There is a great example in the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.duplicated.html) – agent18 Dec 07 '20 at 23:30
3

**Example Usage**: 1. `idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])` 2. `idx.duplicated()` 3. `array([False, False, True, False, True])` – agent18 Dec 07 '20 at 23:31

Dan Allan · Accepted Answer · 2013-11-25T19:58:22.263

23

df.groupby(level=0).filter(lambda x: len(x) > 1)['type']

We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.

Important:

The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.

Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.

There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.

edited Nov 25 '13 at 19:58

answered Nov 25 '13 at 18:36

Dan Allan

34,073
6
70
63

this one is not working for me, I've tried even `df.groupby(level=0).filter(lambda x: True)`, receiving `Exception: Reindexing only valid with uniquely valued Index objects`. – Roman Pekar Nov 25 '13 at 19:32
Good catch! This particular use encounters a bug that was fixed for v0.13, which obviously many users do not have. Answer updated. – Dan Allan Nov 25 '13 at 19:58
Thanks! I'm still on 0.12 and will stick to it until v0.13 is fully released because I'm sharing a codebase and `virtualenv` messes everything up for me. I'll switch to this once we upgrade! Thanks! I've been using `pandas` for a year but I'm still wrapping my mind around `groupby`s – Olga Botvinnik Nov 25 '13 at 20:12
1

When looking at efficiency, I doubt this is faster than `df.set_index('type').index.duplicates` - since it has to iterate through every single group in order to perform, instead of looking at it "from outside the groups". – FooBar May 05 '14 at 10:49
I didn't know about `index.duplicates`. Add it as an answer; that's definitely better. – Dan Allan May 06 '14 at 12:53

score 22 · Answer 3 · answered Feb 15 '17 at 20:19

22

Even faster and better:

df.index.get_duplicates()

answered Feb 15 '17 at 20:19

feetwet

3,248
7
46
84

2

good answer but it comes with future warning for pandas 0.23.4 `'get_duplicates' is deprecated and will be removed in a future release. You can use idx[idx.duplicated()].unique() instead` – Ali Faizan Feb 04 '19 at 13:12

score 9 · Answer 4 · answered Sep 21 '18 at 18:36

9

As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release, instead suggesting the following:

df.index[df.index.duplicated()].unique()

answered Sep 21 '18 at 18:36

Jwok

646
9
23

score 7 · Answer 5 · answered Nov 25 '13 at 17:47

7

>>> df[df.groupby(level=0).transform(len)['type'] > 1]
                   type
genome_location1   MIR3
genome_location1  AluJb

answered Nov 25 '13 at 17:47

Roman Pekar

107,110
28
195
197

3

or, very similarly, with ``filter`` like so: ``df.groupby(level=0).filter(lambda x: len(x) > 1)['type']``. Will be faster than transforming and masking. – Dan Allan Nov 25 '13 at 18:12
@DanAllan great, may be you add another answer and OP'll accept it? – Roman Pekar Nov 25 '13 at 18:18
Thanks! I'll accept this answer since it's guaranteed to work with the current release of `pandas`. – Olga Botvinnik Nov 25 '13 at 20:13
1

I think that's the right choice. Useful conversation, all around. – Dan Allan Nov 25 '13 at 20:15

score 2 · Answer 6 · answered Sep 24 '15 at 11:32

2

More succinctly:

df[df.groupby(level=0).type.count() > 1]

FYI a multi-index:

df[df.groupby(level=[0,1]).type.count() > 1]

answered Sep 24 '15 at 11:32

Fil

1,766
1
15
15

score -1 · Answer 7 · answered Jun 15 '21 at 11:40

-1

This gives you index values along with a preview of duplicated rows

def dup_rows_index(df):
  dup = df[df.duplicated()]
  print('Duplicated index loc:',dup[dup == True ].index.tolist())
  return dup

answered Jun 15 '21 at 11:40

Shakil Shaikh

3
2

Pandas: Get duplicated indexes

7 Answers7

Linked