How to reshape/melt multiple columns in a dataset and split the resulting values into a different number of columns in Polars?

Question

I'm trying to restructure a dataframe by combining values from multiple pairs of columns into a single column for each pair. For example, I've got a column called age_pre and a column called age_post, and I want to combine these two columns into one column called age. Same with the exposure_pre and exposure_post columns.

id	age_pre	age_post	exposure_pre	exposure_post	...
123	55	56	0.49	0.51	...
123	56	57	0.49	0.51	...
456	49	50	0.80	0.20	...
456	50	51	0.80	0.20	...

My desired output would look like this

id	age	exposure
123	55	0.49
123	56	0.51
123	56	0.49
123	57	0.51
456	49	0.8
456	50	0.2
456	50	0.8
456	51	0.2

I tried creating multiple melted dataframes and using .hstack() to create a dataframe with my desired output, put this seems really inefficient. How can I do this with the Polars syntax?

score 2 · Answer 1 · answered Mar 11 '23 at 07:59

Concatenating dataframes can be obtained with concat, vstack or extend. vstack is a very inexpensive way to contatenate.

Below an example using vstack:

import polars as pl

df = pl.DataFrame({
    'id' : [123, 123, 456, 456],
    'age_pre' : [55, 56, 49, 50],
    'age_post' : [56, 57, 50, 51],
    'exposure_pre' : [0.49, 0.49, 0.80, 0.80],
    'exposure_post' : [0.51, 0.51, 0.20, 0.20]
})

column_rename_pre = {column.name : column.name.replace('_pre','') 
    for column in df if 'pre' in column.name}
column_rename_post = {column.name : column.name.replace('_post','') 
    for column in df if 'post' in column.name}

out = (
    df.select(pl.col('^(id|.*_pre)$')).rename(column_rename_pre)
    .vstack(df.select(pl.col('^(id|.*_post)$')).rename(column_rename_post))
)

# result

shape: (8, 3)
┌─────┬─────┬──────────┐
│ id  ┆ age ┆ exposure │
│ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ f64      │
╞═════╪═════╪══════════╡
│ 123 ┆ 55  ┆ 0.49     │
│ 123 ┆ 56  ┆ 0.49     │
│ 456 ┆ 49  ┆ 0.8      │
│ 456 ┆ 50  ┆ 0.8      │
│ 123 ┆ 56  ┆ 0.51     │
│ 123 ┆ 57  ┆ 0.51     │
│ 456 ┆ 50  ┆ 0.2      │
│ 456 ┆ 51  ┆ 0.2      │
└─────┴─────┴──────────┘

There is a good Stack Overflow answer explaining the difference between concat, vstack and extend: How can I append or concatenate two dataframes in python polars?

I copy below the recommendations from that answer

There are the following ways to append data:

concat -> concatenate all given DataFrames. This is sort of a linked list of DataFrames. If you pass rechunk=True, all memory will be reallocated to contiguous chunks.

vstack -> Adds the data from other to DataFrame by incrementing a refcount. This is super cheap. It is recommended to call rechunk after many vstacks. Or simply use pl.concat.

extend This operation copies data. It tries to copy data from other to DataFrame. If however the refcount of DataFrame is larger than 1. A new buffer of memory is allocated to hold both DataFrames.

ResidentSleeper · Accepted Answer · 2023-03-11T10:29:36.433

You can use pl.concat_list then .explode.

out = df.with_columns(
    pl.concat_list(["age_pre", "age_post"]).alias("age"),
    pl.concat_list(["exposure_pre", "exposure_post"]).alias("exposure"),
).select(['id', 'age', 'exposure']).explode(['age', 'exposure'])
    
print(out)
shape: (8, 3)
┌─────┬─────┬──────────┐
│ id  ┆ age ┆ exposure │
│ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ f64      │
╞═════╪═════╪══════════╡
│ 123 ┆ 55  ┆ 0.49     │
│ 123 ┆ 56  ┆ 0.51     │
│ 123 ┆ 56  ┆ 0.49     │
│ 123 ┆ 57  ┆ 0.51     │
│ 456 ┆ 49  ┆ 0.8      │
│ 456 ┆ 50  ┆ 0.2      │
│ 456 ┆ 50  ┆ 0.8      │
│ 456 ┆ 51  ┆ 0.2      │
└─────┴─────┴──────────┘

How to reshape/melt multiple columns in a dataset and split the resulting values into a different number of columns in Polars?

2 Answers2