46

I have a DataFrame that has duplicated rows. I'd like to get a DataFrame with a unique index and no duplicates. It's ok to discard the duplicated values. Is this possible? Would it be a done by groupby?

JJJ
  • 1,009
  • 6
  • 19
  • 31
Adam Greenhall
  • 4,818
  • 6
  • 30
  • 31

2 Answers2

85
In [29]: df.drop_duplicates()
Out[29]: 
   b  c
1  2  3
3  4  0
7  5  9
Wouter Overmeire
  • 65,766
  • 10
  • 63
  • 43
  • It's worthwhile to note this takes either the first or last occurrence. So you need to sort by some other quantity first (if you're lucky) or do some complicated groupby logic anyway. – ely Sep 08 '12 at 02:20
  • 2
    This is wrong. drop_duplicates acts on the values only (at least in my version). You need to reset_index if you want to drop on index and values or just work with the index if you want to have a unique index. Maybe there is another way besides groupby to enforce unique index? – mathtick Jul 11 '13 at 14:02
  • 1
    Use `df.drop_duplicates(inplace=True)` if you don't want to assign a new variable. – Flavian Hautbois Mar 23 '15 at 11:22
  • this does not give a dataframe with unique index, the solution by @Adam Greenhall below, however works for that – dashesy Apr 12 '15 at 18:21
11

Figured out one way to do it by reading the split-apply-combine documentation examples.

df = pandas.DataFrame({'b':[2,2,4,5], 'c': [3,3,0,9]}, index=[1,1,3,7])
df_unique = df.groupby(level=0).first()

df
   b  c
1  2  3
1  2  3
3  4  0
7  5  9

df_unique
   b  c
1  2  3
3  4  0
7  5  9
Adam Greenhall
  • 4,818
  • 6
  • 30
  • 31
  • This relies on the row index being duplicated for rows where the data fields (b,c) are duplicated, effectively making the index part of your row as vector that you want to be unique (not duplicated). – hobs Nov 01 '12 at 20:32
  • 4
    If you have duplicated index entries, this is the answer you want. – rogueleaderr Jun 04 '14 at 00:59
  • I was getting `ValueError: Index contains duplicate entries, cannot reshape` when doing `unstack` on a MultIndex but this solution works for that only I had to do `df_unique = df.groupby(level=[0,1]).first()` – dashesy Apr 12 '15 at 18:19