449

How to remove rows with duplicate index values?

In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file.

I'm reading some automated weather data from the web (observations occur every 5 minutes, and compiled into monthly files for each weather station.) After parsing a file, the DataFrame looks like:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress
Date                                                                                      
2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31
2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30
2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30
2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30
2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

Example of a duplicate case:

import pandas as pd
import datetime

startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pd.date_range(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pd.DataFrame(data=data1, index=index)
df2 = pd.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)

df3
                       A   B
2001-01-01 00:00:00   20 -50
2001-01-01 01:00:00  -30  60
2001-01-01 02:00:00   40 -70
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2

And so I need df3 to eventually become:

                       A   B
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

n8yoder
  • 9,530
  • 2
  • 16
  • 19
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • 2
    Another way of getting duplicates is hourly data in the night when clocks are set back for daylight saving time: 1 AM, 2, 3, 2, 3 again, 4 ... – denis Aug 28 '17 at 09:42
  • 1
    When you say *"remove duplicates"*, your context here implicitly is "keep the first". i.e. `drop_duplicates(keep='first')`. (That's not always the case, sometimes it's harder to figure out from the other fields which row should be kept, or merge multiples, filling NAs from various rows). – smci Jan 21 '21 at 19:54

7 Answers7

827

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop
feetwet
  • 3,248
  • 7
  • 46
  • 84
n8yoder
  • 9,530
  • 2
  • 16
  • 19
  • 9
    `loc` might not be necessary. Simply do `df3 = df3[~df3.index.duplicated(keep='first')]`, which will drop all rows with duplicate index except the first occurrence. – lingjiankong Sep 16 '19 at 18:30
  • 1
    would it make sense to use this for very large time-series where the duplicates are usually just the first or last values? – cheesus Oct 22 '19 at 17:12
  • 7
    what does ~ do in df3 = df3.loc[~df3.index.duplicated(keep='first')] if anyone do not mind answering? – jsl5703 Feb 27 '20 at 01:38
  • 8
    @jsl5703 It reverses the mask. So it turns everything that was True False and vice-versa. In this case, that means that we will select out the ones that are not duplicated according to the method. – n8yoder Feb 27 '20 at 02:35
  • 2
    What witchcraft is this, using a bitwise (~) operation on a DataFrame an it even works. Have a upvote and a comment, because an upvote does not seem enough. Added bonus for %timeit. – Harper Sep 21 '20 at 13:30
  • When you don't know whether you have duplicate indices or not and you still want to use `groupby` method, it will fail. Use `duplicated` method. See my colab notebook https://colab.research.google.com/drive/1ago2DnE7YgJX18j2ESMXOUwW0j6DToAQ?usp=sharing that reproduces my statement. – Astra Uvarova - Saturn's star Nov 20 '20 at 20:24
  • 5
    For those who prefer "chaining": `df3.query("~index.duplicated(keep='first')")` – MikeGM Mar 19 '21 at 11:47
  • 1
    i don't understand why `df = df[df.index.drop_duplicates()]` doesn't work but your suggestion works. can you explain? – Jayen Oct 03 '21 at 02:23
  • 1
    The "keep='first'" parameter is, strangely but practically, sensible only in the presence of the witchcraft not operator ~ – user3673 Dec 24 '21 at 15:03
  • 1
    @Jayen `Index.drop_duplicates` returns an `Index`, which cannot be used to slice the `DataFrame`. Also `reindex` cannot be used on a `DataFrame` that still contains duplicate index values. – Stefan_EOX Jan 17 '23 at 16:01
  • 2
    @Jayen ```Index.duplicated``` returns a boolean array (```numpy.ndarray```) the same length as the original index, which can be used as a mask; as @Stefan_EOX pointed out, ```Index.drop_duplicates``` returns an ```Index```. The ```Index``` contains index labels, not boolean values, and might be shorter in length than the original. Selecting based on index labels rather than a boolean mask will include _all_ of a label's matches from the original data frame, re-introducing duplicates (unless you used ```keep=False``` to get the ```Index```) and defeating the purpose. – Attila the Fun Feb 06 '23 at 19:05
139

This adds the index as a DataFrame column, drops duplicates on that, then removes the new column:

df = (df.reset_index()
        .drop_duplicates(subset='index', keep='last')
        .set_index('index').sort_index())

Note that the use of .sort_index() above at the end is as needed and is optional.

MERose
  • 4,048
  • 7
  • 53
  • 79
D. A.
  • 3,369
  • 3
  • 31
  • 34
  • 11
    Another variation on this is : `df.reset_index().drop_duplicates(cols='index',take_last=True).set_index('index')` – Luciano Feb 20 '14 at 10:35
  • 2
    While this method does work it also creates two temporary copies of the DataFrame and is significantly less performant than using either the duplicated index or groupby methods suggested as alternative answers. – n8yoder Dec 19 '15 at 21:26
  • 1
    If your index is a MultiIndex, `reset_index()` adds columns level_0, level_1, etc. And if your index has a name that name will be used in place of the "index" label. That makes this a bit more than a one-liner to do it right for any DataFrame. `index_label = getattr(df.index, 'names', getattr(df.index, 'name', 'index'))` then `cols=index_label` then `set_index(index_labels)` and even this isn't foolproof (won't work for unnamed multiindexes). – hobs May 23 '16 at 22:35
  • 1
    Moving the index to a column, clearing duplicates, and resetting the index was awesome, that was exactly what I needed! – Sienna Dec 27 '16 at 19:03
  • Given `idx = df.index.name or 'index'`, one could also do `df2 = df.reset_index(); df2.drop_duplicates(idx, inplace=True); df2.set_index(idx, inplace=True)` to avoid the intermediate copies (due to the `inplace=True`) – Anakhand May 27 '19 at 16:16
85

Oh my. This is actually so simple!

grouped = df3.groupby(level=0)
df4 = grouped.last()
df4
                      A   B  rownum

2001-01-01 00:00:00   0   0       6
2001-01-01 01:00:00   1   1       7
2001-01-01 02:00:00   2   2       8
2001-01-01 03:00:00   3   3       3
2001-01-01 04:00:00   4   4       4
2001-01-01 05:00:00   5   5       5

Follow up edit 2013-10-29 In the case where I have a fairly complex MultiIndex, I think I prefer the groupby approach. Here's simple example for posterity:

import numpy as np
import pandas

# fake index
idx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])

# random data + naming the index levels
df1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])
df1.index.names = ['iA', 'iB']

# artificially append some duplicate data
df1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))
df1
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233
#   c   0.275806 -0.078871  # <--- dup 1
#   e  -0.066680  0.607233  # <--- dup 2

and here's the important part

# group the data, using df1.index.names tells pandas to look at the entire index
groups = df1.groupby(level=df1.index.names)  
groups.last() # or .first()
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • if they have names, otherwise (if one name is None) lets say `level=[0,1]` will work if there are 2 levels ``df1.groupby(level=[0,1]).last()``. This should be part of Pandas as a complimentary to `drop_duplicates` – dashesy Apr 12 '15 at 18:38
  • @dashesy yeah. Using `df.index.names` is just an easy way to group by all levels of the index. – Paul H Apr 12 '15 at 18:45
  • Great solution, thank you! I will also add that this works in `xarray` for dealing with duplicate DateTime indices as well that make `ds.resample` and `ds.groupby` operations fail – drg May 15 '18 at 23:26
  • Amendment to my earlier comment: it works in `xarray` as long as you change the `grouped = df3.groupby(level=0)` to `grouped = df3.groupby(dim='time')` or whatever the dimension is that contains duplicates – drg May 15 '18 at 23:34
7

Remove duplicates (Keeping First)

idx = np.unique( df.index.values, return_index = True )[1]
df = df.iloc[idx]

Remove duplicates (Keeping Last)

df = df[::-1]
df = df.iloc[ np.unique( df.index.values, return_index = True )[1] ]

Tests: 10k loops using OP's data

numpy method - 3.03 seconds
df.loc[~df.index.duplicated(keep='first')] - 4.43 seconds
df.groupby(df.index).first() - 21 seconds
reset_index() method - 29 seconds
4

Unfortunately, I don't think Pandas allows one to drop dups off the indices. I would suggest the following:

df3 = df3.reset_index() # makes date column part of your data
df3.columns = ['timestamp','A','B','rownum'] # set names
df3 = df3.drop_duplicates('timestamp',take_last=True).set_index('timestamp') #done!
user128754
  • 49
  • 1
  • 1
3

If anyone like me likes chainable data manipulation using the pandas dot notation (like piping), then the following may be useful:

df3 = df3.query('~index.duplicated()')

This enables chaining statements like this:

df3.assign(C=2).query('~index.duplicated()').mean()
bbiegel
  • 207
  • 2
  • 8
  • I tried this but could not get it to work.. I get an error like this: `TypeError: 'Series' objects are mutable, thus they cannot be hashed`.. Did this actually work for you? – Onno Eberhard Sep 03 '19 at 13:45
  • It worked for me, using Pandas 1.2.2. I had to change the construction of the index, since `.DateTimeIndex` does not support `start` and `end` arguments anymore, but I used `.date_range` instead. – VirtualScooter Feb 28 '21 at 16:53
0

I had the experience with this same error, and after diving into every df, it turns out one of the had 2 columns with the same name, you mention you drop some columns, probably this could be a reason.

Berny
  • 113
  • 11