47

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.

Specifically, I have this dataframe:

import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)

In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False

And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).

I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.

To simplify even further, if we only have the index and the repeat type,

genome_location1    MIR3
genome_location1    AluJb
genome_location2    Tigger1
genome_location3    AT_rich

So the output I'd like to see all duplicate indexes and their repeat types, as such:

genome_location1    MIR3
genome_location1    AluJb

EDIT: added toy example

Olga Botvinnik
  • 1,564
  • 1
  • 14
  • 32
  • 3
    Hi, usually it's good practice to simplify question as much as possible and create toy example with input and desired output. Such a questions are answered much faster and will be useful for future readers. – Roman Pekar Nov 25 '13 at 17:17

7 Answers7

84

Also useful and very succinct:

df[df.index.duplicated()]

Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:

df[df.index.duplicated(keep=False)]

Matthew
  • 10,361
  • 5
  • 42
  • 54
  • There is a great example in the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.duplicated.html) – agent18 Dec 07 '20 at 23:30
  • 3
    **Example Usage**: 1. `idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])` 2. `idx.duplicated()` 3. `array([False, False, True, False, True])` – agent18 Dec 07 '20 at 23:31
23
df.groupby(level=0).filter(lambda x: len(x) > 1)['type']

We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.

Important:

The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.

Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.

There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.

Dan Allan
  • 34,073
  • 6
  • 70
  • 63
  • this one is not working for me, I've tried even `df.groupby(level=0).filter(lambda x: True)`, receiving `Exception: Reindexing only valid with uniquely valued Index objects`. – Roman Pekar Nov 25 '13 at 19:32
  • Good catch! This particular use encounters a bug that was fixed for v0.13, which obviously many users do not have. Answer updated. – Dan Allan Nov 25 '13 at 19:58
  • Thanks! I'm still on 0.12 and will stick to it until v0.13 is fully released because I'm sharing a codebase and `virtualenv` messes everything up for me. I'll switch to this once we upgrade! Thanks! I've been using `pandas` for a year but I'm still wrapping my mind around `groupby`s – Olga Botvinnik Nov 25 '13 at 20:12
  • 1
    When looking at efficiency, I doubt this is faster than `df.set_index('type').index.duplicates` - since it has to iterate through every single group in order to perform, instead of looking at it "from outside the groups". – FooBar May 05 '14 at 10:49
  • I didn't know about `index.duplicates`. Add it as an answer; that's definitely better. – Dan Allan May 06 '14 at 12:53
22

Even faster and better:

df.index.get_duplicates()
feetwet
  • 3,248
  • 7
  • 46
  • 84
  • 2
    good answer but it comes with future warning for pandas 0.23.4 `'get_duplicates' is deprecated and will be removed in a future release. You can use idx[idx.duplicated()].unique() instead` – Ali Faizan Feb 04 '19 at 13:12
9

As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release, instead suggesting the following:

df.index[df.index.duplicated()].unique()
Jwok
  • 646
  • 9
  • 23
7
>>> df[df.groupby(level=0).transform(len)['type'] > 1]
                   type
genome_location1   MIR3
genome_location1  AluJb
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
2

More succinctly:

df[df.groupby(level=0).type.count() > 1]

FYI a multi-index:

df[df.groupby(level=[0,1]).type.count() > 1]
Fil
  • 1,766
  • 1
  • 15
  • 15
-1

This gives you index values along with a preview of duplicated rows

def dup_rows_index(df):
  dup = df[df.duplicated()]
  print('Duplicated index loc:',dup[dup == True ].index.tolist())
  return dup