Find duplicates in dataframe and keep only the highest ones

Question

I'm trying to find the duplicates in a dataframe for each group that are higher so I can remove these later from another dataframe based on the index so the main dataframe is left with no duplicates and only the lowest value.

Basically let's say we have this dataframe:

index   group   value
  1       1      402
  2       1      396
  3       2      406
  4       2      416
  5       2      407
  6       2      406
  7       1      200
  8       2      350

What I need is to only keep the duplicates in each group of consecutive duplicates that have the highest values and remove the lowest one. The group is 1 or 2 but there can be multiple instances of consecutive values in the same group. So the resulting dataframe would be:

index   group   value
  1       1      402
  4       2      416
  5       2      407

Speed is important too and there has to be no lookahead.

From the result you gave, it appears you only want to drop the lowest value in each group (if there are multiple values)? — Joshua Voskamp, Nov 10 '21 at 19:41
yes, it needs to maintain the index, in fact the index is the only thing that I'm interested in. And that's correct, I want to drop the lowest value in each group. — Alex Bejan, Nov 10 '21 at 19:43
why should 416 and 407 remain if they are both in the same group 2? — Алексей Р, Nov 10 '21 at 19:47
because they are higher than 406 which is the lowest value in the group — Alex Bejan, Nov 10 '21 at 19:49
Given OP's (albeit limited and confusing?) description, [this](https://stackoverflow.com/questions/40802800/how-to-groupby-consecutive-values-in-pandas-dataframe) seems relevant — Joshua Voskamp, Nov 10 '21 at 20:33
If @HarryPlotter and I are understanding the intent correctly, please edit the title of this question to more accurately reflect intent. — Joshua Voskamp, Nov 10 '21 at 20:40

Rodalm · Answer 1 · 2021-11-11T14:55:13.597

1

Use groupby + transform to mask the minimum values of each group. Then use the mask to select only the desired rows.

# map each consecutive group of rows to a different integer
group_labels = (df.group != df.group.shift()).cumsum()

# find the minimum value of each group 
group_min_val = df.groupby(group_labels)['value'].transform('min')

# get only the rows of each group whose value is higher than the minimum 
res = df[df.value != group_min_val]

>>> res

   index  group  value
0      1      1    402
3      4      2    416
4      5      2    407

Intermediate Results


>>> group_labels

0    1
1    1
2    2
3    2
4    2
5    2
6    3
7    4
Name: group, dtype: int64

>>> group_min_val

0    396
1    396
2    406
3    406
4    406
5    406
6    200
7    350
Name: value, dtype: int64

>>> df.value != group_min_val

0     True
1    False
2    False
3     True
4     True
5    False
6    False
7    False
Name: value, dtype: bool

edited Nov 11 '21 at 14:55

answered Nov 10 '21 at 19:49

Rodalm

5,169
5
21

the problem is that you could have multiple groups so this would group them all, what I need is the minimum in each group set. I've updated the dataframe – Alex Bejan Nov 10 '21 at 20:00
@AlexBejan what do you mean by 'group'? – Joshua Voskamp Nov 10 '21 at 20:40
1

@AlexBejan I have updated the answer. Is this what you are looking for? – Rodalm Nov 10 '21 at 20:41
@HarryPlotter instead of using `!=` consider `df1.ne(df2)` – Joshua Voskamp Nov 10 '21 at 20:43
1

@JoshuaVoskamp I'm aware of that, but honestly, I prefer this way, I think it's more readable! But it's just a matter of preference – Rodalm Nov 10 '21 at 20:45
1

excellent, working like a charm, thank you sir! – Alex Bejan Nov 11 '21 at 08:46
@AlexBejan you're welcome, I'm glad I could help! – Rodalm Nov 11 '21 at 11:35
@HarryPlotter actually is there a way to make this work if there are 2 equal minimum values? Looks like right now it keeps both so I end up with duplicates again :) – Alex Bejan Nov 11 '21 at 13:14
@AlexBejan Can you share an example of the data and your code in a [pastebin](https://pastebin.com/) (or something similar)? The code should remove all the minimum values of all groups, whether there is just a single or multiple minimums per group. – Rodalm Nov 11 '21 at 13:22
@HarryPlotter I've updated the example, see index 6 which is the same as 3 – Alex Bejan Nov 11 '21 at 13:52
@AlexBejan I don't understand what you mean. The solution works with the new example, I get the expected output. I've updated my answer with the new example, you can check it. Maybe you're doing something else that you aren't telling me. – Rodalm Nov 11 '21 at 14:58

Joshua Voskamp · Answer 2 · 2021-11-10T20:58:30.977

A one-liner version of @HarryPlotter's answer:

df.loc[df.value.ne(df.groupby(df.group.ne(df.group.shift()).cumsum()).value.transform('min'))]

using the trick from here to apply OP's understanding of "grouping", and transforming to get the minimum value in each group, and then .loc-ing for all values not equal to those.

WARNING: this drops any singleton "groups"! (Where OP's comments indicate "keep all but the lowest," this discards singleton values that would technically also be the 'highest' in their group.)

Алексей Р · Accepted Answer · 2021-11-11T13:09:56.113

1

This can be made easier with rank().
In this case, you need to decide how to deal with the same minimum values - delete one of them (method = 'first') or both (method = 'min'). Based on the condition "remove the lowest one" in the solution set method = 'first':

df = pd.DataFrame({'index': [1, 2, 3, 4, 5, 6, 7], 'group': [1, 1, 2, 2, 2, 1, 2],
                   'value': [402, 396, 406, 416, 407, 200, 350]}).set_index('index')
print('Source df:\n', df)
df = df[df.groupby(df.group.diff().ne(0).cumsum())['value'].rank(method='first').gt(1)]
print('\nResult df:\n', df)

Output:

Source df:
        group  value
index              
1          1    402
2          1    396
3          2    406
4          2    416
5          2    407
6          1    200
7          2    350

Result df:
        group  value
index              
1          1    402
4          2    416
5          2    407

edited Nov 11 '21 at 13:09

answered Nov 11 '21 at 04:59

Алексей Р

7,507
2
7
18

doesn't seem to be working, it still has all the values in the group – Alex Bejan Nov 11 '21 at 08:05
This is weird. It works for me, see full example and result above. Just in case, replaced `.ne(1)` with `.gt(1)` – Алексей Р Nov 11 '21 at 08:29
still doesn't work with my data, not sure why. I've accepted @HarryPlotter's solution which works, thanks any way – Alex Bejan Nov 11 '21 at 08:49
The reason for the difference from the expected result was that I did not read the clarification about sequential groups after updating the description. The code has been updated. – Алексей Р Nov 11 '21 at 13:14
Yep it's working now thanks! – Alex Bejan Nov 11 '21 at 14:00
do you know what would I need to change if I wanted to delete the maximum instead of the minimum? Is it ascending=False on rank? seems to be working – Alex Bejan Nov 11 '21 at 14:13
yes, this change is enough – Алексей Р Nov 11 '21 at 15:14

Find duplicates in dataframe and keep only the highest ones

3 Answers3