Remove neighbouring duplicates in a sorted DataFrame

Question

Removing neighbouring duplicates have been discussed before, but only in terms of direct neighbouring (one row above/below) here.

I have the following dataframe:

df = pd.DataFrame(data={"item_no": [11, 4, 4, 4, 7, 8, 7, 11, 11, 5, 5, 6, 4], "time": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]})

df:

    item_no time
0   11      1
1   4       2
2   4       3
3   4       4
4   7       5
5   8       6
6   7       7
7   11      8
8   11      9
9   5       10
10  5       11
11  6       12
12  4       13

where it is sorted by the time column (imagine it as a time-series). I need to remove the neighboring duplicates in the item_no columns, keeping only the first entry.

Expected output:

    item_no time
0   11      1
1   4       2
2   7       5
3   8       6
4   7       7
5   11      8
6   5       10
7   6       12
8   4       13

As can be seen, an arbitrary number of neighboring duplicates should able to be removed. I know I can iterate row by row, and check if the previous item_no is the same. but I am looking for an efficient solution, since this will be applied to millions of rows.

@Linden This is not just a drop duplicate. It's dropping consecutive duplicates. — NYC Coder, Oct 22 '20 at 12:28

score 5 · Accepted Answer · answered Oct 22 '20 at 12:21

5

Please Try

df[df.item_no!=df.item_no.shift(1)]



   item_no  time
0        11     1
1         4     2
4         7     5
5         8     6
6         7     7
7        11     8
9         5    10
11        6    12
12        4    13

answered Oct 22 '20 at 12:21

wwnde

26,119
6
18
32

score 0 · Answer 2 · answered Oct 22 '20 at 12:21

You can use shift to detect adjacent entries that have not changed. From there it is straigt forward:

df = pd.DataFrame(data={"item_no": [11, 4, 4, 4, 7, 8, 7, 11, 11, 5, 5, 6, 4], "time": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]})

ind = df['item_no']==df['item_no'].shift()
df = df.loc[~ind]
print(df)

    item_no  time
0        11     1
1         4     2
4         7     5
5         8     6
6         7     7
7        11     8
9         5    10
11        6    12
12        4    13

score 0 · Answer 3 · answered Oct 22 '20 at 12:24

0

Try using shift:

df = df[df.shift(1) != df].dropna()
print(df)

   item_no  time
0       11     1
1        4     2
4        7     5
5        8     6
6        7     7
7       11     8
9        5    10
11       6    12
12       4    13

answered Oct 22 '20 at 12:24

NYC Coder

7,424
2
11
24

Remove neighbouring duplicates in a sorted DataFrame

3 Answers3