2

Given the following data, i'm looking to groupby and combine two columns into one, holding a dictionary. One column supplies the keys, while the values stem from another column which is aggregated into a list first.

import polars as pl

data = pl.DataFrame(
    {
        "names": ["foo", "ham", "spam", "cheese", "egg", "foo"],
        "dates": ["1", "1", "2", "3", "3", "4"],
        "groups": ["A", "A", "B", "B", "B", "C"],
    }
)

>>> print(data)
    names dates groups
0     foo     1      A
1     ham     1      A
2    spam     2      B
3  cheese     3      B
4     egg     3      B
5     foo     4      C


# This is what i'm trying to do:
  groups                                 combined
0      A                    {'1': ['foo', 'ham']}
1      B  {'2': ['spam'], '3': ['cheese', 'egg']}
2      C                           {'4': ['foo']}

In pandas i can do this using two groupby statements, in pyspark using a set of operations around "map_from_entries" but despite various attempts i haven't figured out a way in polars. So far i use agg_list(), convert to pandas and use a lambda. While this works, it certainly doesn't feel right.

data = data.groupby(["groups", "dates"])["names"].agg_list()

data = (
    data.to_pandas()
    .groupby(["groups"])
    .apply(lambda x: dict(zip(x["dates"], x["names_agg_list"])))
    .reset_index(name="combined")
    )

Alternativly, inspired by this post i've tried a number of variations similar to the following, including converting the dict to json strings among other things.

data = data.groupby(["groups"]).agg(
    pl.apply(exprs=["dates", "names_agg_list"], f=build_dict).alias("combined")
    )
Andreas
  • 285
  • 8
  • 12

1 Answers1

3

With the release of polars>=0.12.10 you can do this:

print(data
    .groupby(["groups", "dates"]).agg(pl.col("names").list().keep_name())
    .groupby("groups")
    .agg([
        pl.apply([pl.col("dates"), pl.col("names")], lambda s: dict(zip(s[0], s[1].to_list())))
    ])
)
shape: (3, 2)
┌────────┬─────────────────────────────────────┐
│ groups ┆ dates                               │
│ ---    ┆ ---                                 │
│ str    ┆ object                              │
╞════════╪═════════════════════════════════════╡
│ A      ┆ {'1': ['foo', 'ham']}               │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ C      ┆ {'4': ['foo']}                      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B      ┆ {'3': ['cheese', 'egg'], '2': ['... │
└────────┴─────────────────────────────────────┘

This not really how you should be using DataFrames though. There is likely a solution that lets you deal with more flattened dataframes and doesn't require you to put slow python objects in dataframes.

ritchie46
  • 10,405
  • 1
  • 24
  • 43