2

I am trying to use polars to do a window aggregate over one value, but map it back to another.

For example, if i wanted to get the name of the max value in a group, instead of (or in combination to) just the max value.

assuming an input of something like this.

|label|name|value|
|a.  | foo| 1.   |
|a.  | bar| 2.   |
|b.  | baz| 1.5. |
|b.  | boo| -1   |
# 'max_by' is not a real method, just using it to express what i'm trying to achieve. 
df.select(col('label'), col('name').max_by('value').over('label'))

i want an output like this

|label|name|
|a.   | bar|
|b.   | baz|

ideally with the value too. But i know i can easily add that in via col('value').max().over('label').

|label|name|value|
|a.   | bar| 2.  |
|b.   | baz| 1.5.|
Cory Grinstead
  • 511
  • 3
  • 16

2 Answers2

4

You were close. There is a sort_by expression that can be used.

df.groupby('label').agg(pl.all().sort_by('value').last())
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ ---   ┆ ---  ┆ ---   │
│ str   ┆ str  ┆ f64   │
╞═══════╪══════╪═══════╡
│ a.    ┆ bar  ┆ 2.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b.    ┆ baz  ┆ 1.5   │
└───────┴──────┴───────┘

If you need a windowed version of this:

df.with_columns([
    pl.col(['name','value']).sort_by('value').last().over('label').suffix("_max")
])
shape: (4, 5)
┌───────┬──────┬───────┬──────────┬───────────┐
│ label ┆ name ┆ value ┆ name_max ┆ value_max │
│ ---   ┆ ---  ┆ ---   ┆ ---      ┆ ---       │
│ str   ┆ str  ┆ f64   ┆ str      ┆ f64       │
╞═══════╪══════╪═══════╪══════════╪═══════════╡
│ a.    ┆ foo  ┆ 1.0   ┆ bar      ┆ 2.0       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a.    ┆ bar  ┆ 2.0   ┆ bar      ┆ 2.0       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b.    ┆ baz  ┆ 1.5   ┆ baz      ┆ 1.5       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b.    ┆ boo  ┆ -1.0  ┆ baz      ┆ 1.5       │
└───────┴──────┴───────┴──────────┴───────────┘
0

You can do it using DataFrame.filter

import polars as pl

df = pl.DataFrame({
    'label': ['a.', 'a.', 'b.', 'b.'],
    'name': ['foo', 'bar', 'baz', 'boo'],
    'value': [1, 2, 1.5, -1]
})

res = df.filter(
    pl.col('value') == pl.max('value').over('label')
)

Output:

>>> res

shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ ---   ┆ ---  ┆ ---   │
│ str   ┆ str  ┆ f64   │
╞═══════╪══════╪═══════╡
│ a.    ┆ bar  ┆ 2.0   │
│ b.    ┆ baz  ┆ 1.5   │
└───────┴──────┴───────┘
Rodalm
  • 5,169
  • 5
  • 21