1

code in user guider is as follows:

def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")

q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ]
    )
    .limit(5)
)

df = q.collect()
df

1 May the real order of sort().groupby() execute groupby first and then execute sort? ,which is similar to pandas?
answer by @tvashtar about this question provides some tips.

lemmingxuan
  • 549
  • 1
  • 7
  • 18
  • I think you should ask one question per SO post. Can you split up this post in multiple questions? – ritchie46 May 08 '22 at 06:44
  • okk,I have edited the Question. – lemmingxuan May 08 '22 at 06:46
  • I don't think pandas first does the `groupby`. Pandas does not reorder operations. – ritchie46 May 08 '22 at 07:32
  • I remove those content. The osf is a good place to ask questions, but it seems to be focused on a single question, so maybe it's not a good channel for "continue discussion".My original question has been answered, although subsequent study has made me think that there may be a little problem with the original answer( thans for his hard work), but it seems to be more difficult to achieve a quick contact and communication with the original author. – lemmingxuan May 10 '22 at 00:50

1 Answers1

0

The logical order of a polars query is the order you read it from top to bottom.

q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ]
    )
    .limit(5)
)

This snippets has the following order of operations sort -> groupby/agg -> limit.

Note that polars may choose to execute the query in a different order IFF the outcome is the same. This might be done for performance reasons.

1 May the real order of sort().groupby() execute groupby first and then execute sort? ,which is similar to pandas?

I don't think that pandas does this. The result would be incorrect if it did. The outcome of a first aggregation changes by sorting, so if we would decide to do the sort after the groupby operation, we would have changed the outcome of the query and thus this optimization is invalid.

ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • so it doesn't group by `state` and sort the `birthday` within group. If that's true, I think it may be not a good example for user guide. – lemmingxuan May 08 '22 at 07:44
  • Why isn't it a good example? I don't think the user guide claimed it sorted the birthday within that group. – ritchie46 May 08 '22 at 08:33
  • But if it groups by `state` but otherwise preserves the order of the original data then this is the same as sorting by `birthday` within `state`? This clearly seems like what they suggest is done in the docs: https://pola-rs.github.io/polars-book/user-guide/dsl/groupby.html#sorting – Benjamin Christoffersen Feb 13 '23 at 14:20
  • Also the docs states the following: _"Let's say that we want to get the names of the oldest and youngest politicians per state."_ So if the people are __not__ sorted by `birthday` within the `state` in the example then it is a truly bad example. – Benjamin Christoffersen Feb 13 '23 at 14:32
  • It first sorts the data and then groups by `state`. This will ensure that the within the groupby aggregation is sorted. – ritchie46 Feb 13 '23 at 18:27
  • Yes, so then the user guide claims it sorted the `birthday` within that group, `state`, or at least something equivalent (within the groupby aggregation is sorted -- order does not matter). This is contrary to your earlier comment: _"I don't think the user guide claimed it sorted the birthday within that group"_ unless I am miss-understanding something? Also this seems to go against the remark: "_The result would be incorrect if it did._" The result would be the same whether it a) sorts and groups but preserve order or b) groups and sort within group. – Benjamin Christoffersen Feb 15 '23 at 08:01
  • Could you make a proposal PR on the user guide for better wording? – ritchie46 Feb 15 '23 at 12:54
  • I am not saying the user guide is wrong or confusing but that your comments and answer here seem contradictory. – Benjamin Christoffersen Feb 15 '23 at 13:12