41

If I want to drop duplicated index in a dataframe the following doesn't work for obvious reasons:

myDF.drop_duplicates(cols=index)

and

myDF.drop_duplicates(cols='index') 

looks for a column named 'index'

If I want to drop an index I have to do:

myDF['index'] = myDF.index
myDF= myDF.drop_duplicates(cols='index')
myDF.set_index = myDF['index']
myDF= myDF.drop('index', axis =1)

Is there a more efficient way?

RukTech
  • 5,065
  • 5
  • 22
  • 23
  • 1
    http://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries – Paul H Apr 07 '14 at 17:11
  • 1
    @PaulH: The answer to your question by Luciano is the same as my question just in a single line – RukTech Apr 07 '14 at 17:20

3 Answers3

60

Simply: DF.groupby(DF.index).first()

CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • @CT Zhu - If I use this method it is combining my tow indes columsn into a single column.I don't want that to happed.Is there a way around it? – liv2hak Nov 18 '15 at 18:57
  • @liv2hak, mind ask a new question with a minimal example dataset? – CT Zhu Nov 19 '15 at 01:04
  • @CTZhu - I have figured that out. But can you take a look at http://stackoverflow.com/questions/33792915/pandas-mean-calculation-over-a-column-in-a-csv .thanks – liv2hak Nov 19 '15 at 01:36
  • @CTZhu This transforms the the geopandas data frame to a pandas data frame, which may create problems (it did to me). – Duccio Piovani Oct 29 '17 at 16:30
52

The 'duplicated' method works for dataframes and for series. Just select on those rows which aren't marked as having a duplicate index:

df[~df.index.duplicated()]
danielstn
  • 676
  • 5
  • 5
  • This would drop all duplicates though? – The Unfun Cat Dec 01 '15 at 11:11
  • 11
    Note that this is the fastest method for the test cases I investigated: http://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries/34297689#34297689 You can also reproduce the behavior of the accepted answer exactly using: `df[~df.index.duplicated(keep='first)]` – n8yoder Dec 15 '15 at 19:34
  • 2
    `keep` defaults to `first` anyway. – Jérôme May 04 '18 at 14:29
8

You can use numpy.unique to obtain the index of unique values and use iloc to get those indices:

>>> df
        val
A  0.021372
B  1.229482
D -1.571025
D -0.110083
C  0.547076
B -0.824754
A -1.378705
B -0.234095
C -1.559653
B -0.531421

[10 rows x 1 columns]

>>> idx = np.unique(df.index, return_index=True)[1]
>>> df.iloc[idx]
        val
A  0.021372
B  1.229482
C  0.547076
D -1.571025

[4 rows x 1 columns]
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124