0

I'm trying to find the duplicates in a dataframe for each group that are higher so I can remove these later from another dataframe based on the index so the main dataframe is left with no duplicates and only the lowest value.

Basically let's say we have this dataframe:

index   group   value
  1       1      402
  2       1      396
  3       2      406
  4       2      416
  5       2      407
  6       2      406
  7       1      200
  8       2      350

What I need is to only keep the duplicates in each group of consecutive duplicates that have the highest values and remove the lowest one. The group is 1 or 2 but there can be multiple instances of consecutive values in the same group. So the resulting dataframe would be:

index   group   value
  1       1      402
  4       2      416
  5       2      407

Speed is important too and there has to be no lookahead.

Alex Bejan
  • 21
  • 6

3 Answers3

1

Use groupby + transform to mask the minimum values of each group. Then use the mask to select only the desired rows.

# map each consecutive group of rows to a different integer
group_labels = (df.group != df.group.shift()).cumsum()

# find the minimum value of each group 
group_min_val = df.groupby(group_labels)['value'].transform('min')

# get only the rows of each group whose value is higher than the minimum 
res = df[df.value != group_min_val]

>>> res

   index  group  value
0      1      1    402
3      4      2    416
4      5      2    407

Intermediate Results


>>> group_labels

0    1
1    1
2    2
3    2
4    2
5    2
6    3
7    4
Name: group, dtype: int64

>>> group_min_val

0    396
1    396
2    406
3    406
4    406
5    406
6    200
7    350
Name: value, dtype: int64

>>> df.value != group_min_val

0     True
1    False
2    False
3     True
4     True
5    False
6    False
7    False
Name: value, dtype: bool
Rodalm
  • 5,169
  • 5
  • 21
  • the problem is that you could have multiple groups so this would group them all, what I need is the minimum in each group set. I've updated the dataframe – Alex Bejan Nov 10 '21 at 20:00
  • @AlexBejan what do you mean by 'group'? – Joshua Voskamp Nov 10 '21 at 20:40
  • 1
    @AlexBejan I have updated the answer. Is this what you are looking for? – Rodalm Nov 10 '21 at 20:41
  • @HarryPlotter instead of using `!=` consider `df1.ne(df2)` – Joshua Voskamp Nov 10 '21 at 20:43
  • 1
    @JoshuaVoskamp I'm aware of that, but honestly, I prefer this way, I think it's more readable! But it's just a matter of preference – Rodalm Nov 10 '21 at 20:45
  • 1
    excellent, working like a charm, thank you sir! – Alex Bejan Nov 11 '21 at 08:46
  • @AlexBejan you're welcome, I'm glad I could help! – Rodalm Nov 11 '21 at 11:35
  • @HarryPlotter actually is there a way to make this work if there are 2 equal minimum values? Looks like right now it keeps both so I end up with duplicates again :) – Alex Bejan Nov 11 '21 at 13:14
  • @AlexBejan Can you share an example of the data and your code in a [pastebin](https://pastebin.com/) (or something similar)? The code should remove all the minimum values of all groups, whether there is just a single or multiple minimums per group. – Rodalm Nov 11 '21 at 13:22
  • @HarryPlotter I've updated the example, see index 6 which is the same as 3 – Alex Bejan Nov 11 '21 at 13:52
  • @AlexBejan I don't understand what you mean. The solution works with the new example, I get the expected output. I've updated my answer with the new example, you can check it. Maybe you're doing something else that you aren't telling me. – Rodalm Nov 11 '21 at 14:58
1

A one-liner version of @HarryPlotter's answer:

df.loc[df.value.ne(df.groupby(df.group.ne(df.group.shift()).cumsum()).value.transform('min'))]

using the trick from here to apply OP's understanding of "grouping", and transforming to get the minimum value in each group, and then .loc-ing for all values not equal to those.

WARNING: this drops any singleton "groups"! (Where OP's comments indicate "keep all but the lowest," this discards singleton values that would technically also be the 'highest' in their group.)

Joshua Voskamp
  • 1,855
  • 1
  • 10
  • 13
1

This can be made easier with rank().
In this case, you need to decide how to deal with the same minimum values - delete one of them (method = 'first') or both (method = 'min'). Based on the condition "remove the lowest one" in the solution set method = 'first':

df = pd.DataFrame({'index': [1, 2, 3, 4, 5, 6, 7], 'group': [1, 1, 2, 2, 2, 1, 2],
                   'value': [402, 396, 406, 416, 407, 200, 350]}).set_index('index')
print('Source df:\n', df)
df = df[df.groupby(df.group.diff().ne(0).cumsum())['value'].rank(method='first').gt(1)]
print('\nResult df:\n', df)

Output:

Source df:
        group  value
index              
1          1    402
2          1    396
3          2    406
4          2    416
5          2    407
6          1    200
7          2    350

Result df:
        group  value
index              
1          1    402
4          2    416
5          2    407
Алексей Р
  • 7,507
  • 2
  • 7
  • 18