Fastest Way to Drop Duplicated Index in a Pandas DataFrame

Question

If I want to drop duplicated index in a dataframe the following doesn't work for obvious reasons:

myDF.drop_duplicates(cols=index)

and

myDF.drop_duplicates(cols='index')

looks for a column named 'index'

If I want to drop an index I have to do:

myDF['index'] = myDF.index
myDF= myDF.drop_duplicates(cols='index')
myDF.set_index = myDF['index']
myDF= myDF.drop('index', axis =1)

Is there a more efficient way?

http://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries — Paul H, Apr 07 '14 at 17:11
@PaulH: The answer to your question by Luciano is the same as my question just in a single line — RukTech, Apr 07 '14 at 17:20

score 60 · Accepted Answer · answered Apr 07 '14 at 17:02

60

Simply: DF.groupby(DF.index).first()

answered Apr 07 '14 at 17:02

CT Zhu

52,648
17
120
133

@CT Zhu - If I use this method it is combining my tow indes columsn into a single column.I don't want that to happed.Is there a way around it? – liv2hak Nov 18 '15 at 18:57
@liv2hak, mind ask a new question with a minimal example dataset? – CT Zhu Nov 19 '15 at 01:04
@CTZhu - I have figured that out. But can you take a look at http://stackoverflow.com/questions/33792915/pandas-mean-calculation-over-a-column-in-a-csv .thanks – liv2hak Nov 19 '15 at 01:36
@CTZhu This transforms the the geopandas data frame to a pandas data frame, which may create problems (it did to me). – Duccio Piovani Oct 29 '17 at 16:30

score 52 · Answer 2 · answered Oct 28 '15 at 09:31

52

The 'duplicated' method works for dataframes and for series. Just select on those rows which aren't marked as having a duplicate index:

df[~df.index.duplicated()]

answered Oct 28 '15 at 09:31

danielstn

676
5
5

This would drop all duplicates though? – The Unfun Cat Dec 01 '15 at 11:11
11

Note that this is the fastest method for the test cases I investigated: http://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries/34297689#34297689 You can also reproduce the behavior of the accepted answer exactly using: `df[~df.index.duplicated(keep='first)]` – n8yoder Dec 15 '15 at 19:34
2

`keep` defaults to `first` anyway. – Jérôme May 04 '18 at 14:29

score 8 · Answer 3 · answered Apr 07 '14 at 16:52

8

You can use numpy.unique to obtain the index of unique values and use iloc to get those indices:

>>> df
        val
A  0.021372
B  1.229482
D -1.571025
D -0.110083
C  0.547076
B -0.824754
A -1.378705
B -0.234095
C -1.559653
B -0.531421

[10 rows x 1 columns]

>>> idx = np.unique(df.index, return_index=True)[1]
>>> df.iloc[idx]
        val
A  0.021372
B  1.229482
C  0.547076
D -1.571025

[4 rows x 1 columns]

answered Apr 07 '14 at 16:52

behzad.nouri

74,723
18
126
124

2

This is MUCH faster.....even faster if you use df.ix[idx] – baconwichsand Jun 09 '15 at 15:39

Fastest Way to Drop Duplicated Index in a Pandas DataFrame

3 Answers3