Remove consecutive duplicates while keeping the max value

Question

I am trying to remove consecutive duplicates from column X while keeping the entry with the max value based on column Y, unfortunately with no success. The data frame is as follow:

idx	X	Y
0	A	3
1	B	2
2	A	7
3	A	10
4	B	1
5	C	4
6	A	3
7	A	3

What I want to achieve is:

idx	X	Y
0	A	3
1	B	2
3	A	10
4	B	1
5	C	4
7	A	3

Most of the solutions I found just remove the duplicates tout court without accounting for any repeating pattern.

Please note that the duplicates might have the same value.

timgeb · Answer 1 · 2021-12-22T14:41:43.260

3

You need to apply an itertools-style-groupby and then keep the rows where Y is maximal.

>>> df 
   idx  X   Y
0    0  A   3
1    1  B   2
2    2  A   7
3    3  A  10
4    4  B   1
5    5  C   4
6    6  A   3
7    7  A   5
>>> y_max = df.groupby(df['X'].ne(df['X'].shift()).cumsum())['Y'].transform('max')
>>> df[df['Y'] == y_max] 
   idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5

edit:

Initial solution had a bug and only produced the correct idx column by accident.

edit 2:

If you only want to keep one row per group, you can use

>>> y_idxmax = df.groupby(df['X'].ne(df['X'].shift()).cumsum())['Y'].idxmax()
>>> df.loc[y_idxmax] 
   idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5

Credit goes to Ch3steR for this one.

edited Dec 22 '21 at 14:41

answered Dec 22 '21 at 11:50

timgeb

76,762
20
123
145

Hi, I forgot to add that the duplicates might have the same value. Your first solution was producing the right amount of values on my full dataset. – massigarg Dec 22 '21 at 12:53
@massigarg yes, but the first solution can produce a wrong idx column. To clarify, you only want to keep the first row per group where Y is maximal? – timgeb Dec 22 '21 at 13:01
Yes correct., something like max(Yi, Yi+1) – massigarg Dec 22 '21 at 13:06
2

@timgeb Or you could use [`GroupBy.idxmax`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.idxmax.html) here. Maybe something like this `idx = df.groupby(itertools-style-grouping)['Y'].idxmax(); df.loc[idx, :] # or df.reindex(idx)` – Ch3steR Dec 22 '21 at 14:13
@massigarg see Ch3steR's comment – timgeb Dec 22 '21 at 14:24
@Ch3steR cool, that will only keep one row per group where Y is maximal. – timgeb Dec 22 '21 at 14:24

score 1 · Answer 2 · answered Dec 22 '21 at 12:02

1

Or I'd prefer just simply only specify the groups in the groupby parameters:

df.groupby(df['X'].ne(df['X'].shift()).cumsum(), as_index=False).max()

Or:

df.groupby(df['X'].ne(df['X'].shift()).cumsum()).max().reset_index(drop=True)

Both output:

   idx  X   Y
0    0  A   3
1    1  B   2
2    3  A  10
3    4  B   1
4    5  C   4
5    7  A   5

answered Dec 22 '21 at 12:02

U13-Forward

69,221
14
89
114

1

This produces the right idx column "by accident", because coincidentally the maximum Y value is always at the highest idx value (per group). If you switch the 7 and the 10 in the original dataframe the result will be wrong. – timgeb Dec 22 '21 at 12:21

wwnde · Answer 3 · 2021-12-22T12:31:25.360

Create a column which heaps consecutives into one group

   df['temp']=(~(df['X']==df['X'].shift())|(df['X'].shift(-1)==df['X'])).cumsum()

groupby the group of consecutives and filter out wherevalues of Y are equal to maximum in each group. Drop the temp column

df[df.groupby('temp')['Y'].transform(lambda x:(x==x.max()))].drop(columns=['temp'])

A neater way is not to create column but save the groups of consecutives into a variable and groupby the variable as follows

s=(~(df['X']==df['X'].shift())|(df['X'].shift(-1)==df['X'])).cumsum()
print(df[df.groupby(s)['Y'].transform(lambda x:(x==x.max()))])

    idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5

score 0 · Answer 4 · answered Dec 22 '21 at 13:01

I couldn't figure out the already given answers right away, so I wrote a simple script to do the same. It takes the indices that have duplicate values and drop them by comparing two at a time.

Check the code below -

import pandas as pd

data = {'X':['A', 'B', 'A', 'A', 'A', 'B', 'C', 'A', 'A'], 
        'Y': [3, 2, 12, 7, 10, 1, 4, 3, 5]}
data = pd.DataFrame(data)

mask = data['X'] == data['X'].shift()

to_check = data.loc[mask].index.tolist()

for i, _ in enumerate(to_check):
    index = to_check[i]
    if data.iloc[index]['Y'] > data.iloc[index - 1]['Y']:
        data.drop(index - 1, axis=0, inplace=True)
        data.reset_index(inplace=True, drop=True)
    else:
        data.drop(index, axis=0, inplace=True)
        data.reset_index(inplace=True, drop=True)

    to_check = [value - 1 for value in to_check]

print(data)

# OUTPUT

   X   Y
0  A   3
1  B   2
2  A  12
3  B   1
4  C   4
5  A   5

Remove consecutive duplicates while keeping the max value

4 Answers4