I'm pivoting a rather large dataframe of shape (10_000_000, 678)
into one of approx. shape (770_000, 8_789)
to create a dataset for an ML algorithm. It's a relatively slow operation taking about half an hour on a high-ram cluster I am using, and I'm wondering if there is a way to speed it up. Here is a minimum example, with a larger one below:
import polars as pl
import numpy as np
data = {
"id": [1,1,1,2,2,2,3,3,3],
"rank": [1,2,3,1,2,3,1,2,3], # rank is always repeating 1-3 (or 0-12 in large example)
"A": np.random.random((9)),
"B": np.random.random((9)),
}
df = pl.DataFrame(data)
df_pivot = df.pivot(values=["A", "B"], index="id", columns="rank")
# Now rename columns, since they are currently:
# df_pivot.columns
# ['id', '1', '2', '3', '1', '2', '3']
ranks = [1,2,3]
renamed_columns = df_pivot.columns[:1]
for col in df.columns[2:]:
for rank in ranks:
renamed_columns.append(f"{col}_{rank}")
df_pivot.columns = renamed_columns
# df_pivot
shape: (3, 7)
┌─────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ id ┆ A_1 ┆ A_2 ┆ A_3 ┆ B_1 ┆ B_2 ┆ B_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
│ 1 ┆ 0.867957 ┆ 0.854234 ┆ 0.408062 ┆ 0.076254 ┆ 0.899092 ┆ 0.059019 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.642296 ┆ 0.670476 ┆ 0.480494 ┆ 0.4254 ┆ 0.536173 ┆ 0.492312 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0.778481 ┆ 0.151697 ┆ 0.330138 ┆ 0.6661 ┆ 0.4086 ┆ 0.992057 │
└─────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
The polars pivot code states that in a comment:
Polars lazy does not implement a pivot because it is impossible to know the schema without materializing the whole dataset. This makes a pivot quite a terrible operation for performant workflows. An optimization can never be pushed down passed a pivot.
And in the groupby.pivot code:
Polars'/arrow memory is not ideal for transposing operations like pivots. If you have a relatively large table, consider using a groupby over a pivot.
Some questions:
Is it possible to replace the above pivot example by a (preferably lazy) combination of groupby and something else? This SO post about pandas suggests an equivalency of groupby + "unstack" with pivot. Polars does not implement an unstack function, afaik.
Is the above suggestion more performant than the current pivot implementation? (See the larger example below).
I actually do know the schema ahead-of-schedule, since in my situation
rank
is a known series ([1, 2, 3] in the example). If implemented, would a lazy pivot where one can supply the schema be more performant than the eager one?Should I be implementing it differently?
# Much larger example, but with 10_000 rows instead of 10_000_000
# 10_000 runs in 3 seconds, 100_000 runs in 40 seconds (M1 macbook)
from string import ascii_lowercase
import polars as pl
import numpy as np
ranks = np.arange(13)
N_ROWS = 10_000 # this could be ~10_000_000
df = (pl.DataFrame({"ID": np.arange(N_ROWS)})).join(
pl.DataFrame({"rank": ranks}), how="cross"
)
# create 26**2 dummy column names
column_names = []
for letter1 in ascii_lowercase:
for letter2 in ascii_lowercase:
column_names.append(letter1 + letter2)
# stack frames to create: ID, ranks, aa, ab, ..., zz
df = df.hstack(
pl.DataFrame({letter: np.random.random(len(df)) for letter in column_names})
)
df_pivot = df.pivot(values=df.columns[2:], index="ID", columns="rank")
renamed_columns = df_pivot.columns[:1]
for col in df.columns[2:]:
for rank in ranks:
renamed_columns.append(f"{col}_{rank}")
df_pivot.columns = renamed_columns