6

I am trying to remove consecutive duplicates from column X while keeping the entry with the max value based on column Y, unfortunately with no success. The data frame is as follow:

idx X Y
0 A 3
1 B 2
2 A 7
3 A 10
4 B 1
5 C 4
6 A 3
7 A 3

What I want to achieve is:

idx X Y
0 A 3
1 B 2
3 A 10
4 B 1
5 C 4
7 A 3

Most of the solutions I found just remove the duplicates tout court without accounting for any repeating pattern.

Please note that the duplicates might have the same value.

massigarg
  • 99
  • 7

4 Answers4

3

You need to apply an itertools-style-groupby and then keep the rows where Y is maximal.

>>> df 
   idx  X   Y
0    0  A   3
1    1  B   2
2    2  A   7
3    3  A  10
4    4  B   1
5    5  C   4
6    6  A   3
7    7  A   5
>>> y_max = df.groupby(df['X'].ne(df['X'].shift()).cumsum())['Y'].transform('max')
>>> df[df['Y'] == y_max] 
   idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5

edit:

Initial solution had a bug and only produced the correct idx column by accident.

edit 2:

If you only want to keep one row per group, you can use

>>> y_idxmax = df.groupby(df['X'].ne(df['X'].shift()).cumsum())['Y'].idxmax()
>>> df.loc[y_idxmax] 
   idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5

Credit goes to Ch3steR for this one.

timgeb
  • 76,762
  • 20
  • 123
  • 145
  • Hi, I forgot to add that the duplicates might have the same value. Your first solution was producing the right amount of values on my full dataset. – massigarg Dec 22 '21 at 12:53
  • @massigarg yes, but the first solution can produce a wrong idx column. To clarify, you only want to keep the first row per group where Y is maximal? – timgeb Dec 22 '21 at 13:01
  • Yes correct., something like max(Yi, Yi+1) – massigarg Dec 22 '21 at 13:06
  • 2
    @timgeb Or you could use [`GroupBy.idxmax`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.idxmax.html) here. Maybe something like this `idx = df.groupby(itertools-style-grouping)['Y'].idxmax(); df.loc[idx, :] # or df.reindex(idx)` – Ch3steR Dec 22 '21 at 14:13
  • @massigarg see Ch3steR's comment – timgeb Dec 22 '21 at 14:24
  • @Ch3steR cool, that will only keep one row per group where Y is maximal. – timgeb Dec 22 '21 at 14:24
1

Or I'd prefer just simply only specify the groups in the groupby parameters:

df.groupby(df['X'].ne(df['X'].shift()).cumsum(), as_index=False).max()

Or:

df.groupby(df['X'].ne(df['X'].shift()).cumsum()).max().reset_index(drop=True)

Both output:

   idx  X   Y
0    0  A   3
1    1  B   2
2    3  A  10
3    4  B   1
4    5  C   4
5    7  A   5
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • 1
    This produces the right idx column "by accident", because coincidentally the maximum Y value is always at the highest idx value (per group). If you switch the 7 and the 10 in the original dataframe the result will be wrong. – timgeb Dec 22 '21 at 12:21
0

Create a column which heaps consecutives into one group

   df['temp']=(~(df['X']==df['X'].shift())|(df['X'].shift(-1)==df['X'])).cumsum()

groupby the group of consecutives and filter out wherevalues of Y are equal to maximum in each group. Drop the temp column

df[df.groupby('temp')['Y'].transform(lambda x:(x==x.max()))].drop(columns=['temp'])

A neater way is not to create column but save the groups of consecutives into a variable and groupby the variable as follows

s=(~(df['X']==df['X'].shift())|(df['X'].shift(-1)==df['X'])).cumsum()
print(df[df.groupby(s)['Y'].transform(lambda x:(x==x.max()))])

    idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5
wwnde
  • 26,119
  • 6
  • 18
  • 32
0

I couldn't figure out the already given answers right away, so I wrote a simple script to do the same. It takes the indices that have duplicate values and drop them by comparing two at a time.

Check the code below -

import pandas as pd

data = {'X':['A', 'B', 'A', 'A', 'A', 'B', 'C', 'A', 'A'], 
        'Y': [3, 2, 12, 7, 10, 1, 4, 3, 5]}
data = pd.DataFrame(data)

mask = data['X'] == data['X'].shift()

to_check = data.loc[mask].index.tolist()

for i, _ in enumerate(to_check):
    index = to_check[i]
    if data.iloc[index]['Y'] > data.iloc[index - 1]['Y']:
        data.drop(index - 1, axis=0, inplace=True)
        data.reset_index(inplace=True, drop=True)
    else:
        data.drop(index, axis=0, inplace=True)
        data.reset_index(inplace=True, drop=True)

    to_check = [value - 1 for value in to_check]

print(data)

# OUTPUT

   X   Y
0  A   3
1  B   2
2  A  12
3  B   1
4  C   4
5  A   5