1

I'm trying to restructure a dataframe by combining values from multiple pairs of columns into a single column for each pair. For example, I've got a column called age_pre and a column called age_post, and I want to combine these two columns into one column called age. Same with the exposure_pre and exposure_post columns.

id age_pre age_post exposure_pre exposure_post ...
123 55 56 0.49 0.51 ...
123 56 57 0.49 0.51 ...
456 49 50 0.80 0.20 ...
456 50 51 0.80 0.20 ...

My desired output would look like this

id age exposure
123 55 0.49
123 56 0.51
123 56 0.49
123 57 0.51
456 49 0.8
456 50 0.2
456 50 0.8
456 51 0.2

I tried creating multiple melted dataframes and using .hstack() to create a dataframe with my desired output, put this seems really inefficient. How can I do this with the Polars syntax?

2 Answers2

2

Concatenating dataframes can be obtained with concat, vstack or extend. vstack is a very inexpensive way to contatenate.

Below an example using vstack:

import polars as pl

df = pl.DataFrame({
    'id' : [123, 123, 456, 456],
    'age_pre' : [55, 56, 49, 50],
    'age_post' : [56, 57, 50, 51],
    'exposure_pre' : [0.49, 0.49, 0.80, 0.80],
    'exposure_post' : [0.51, 0.51, 0.20, 0.20]
})

column_rename_pre = {column.name : column.name.replace('_pre','') 
    for column in df if 'pre' in column.name}
column_rename_post = {column.name : column.name.replace('_post','') 
    for column in df if 'post' in column.name}

out = (
    df.select(pl.col('^(id|.*_pre)$')).rename(column_rename_pre)
    .vstack(df.select(pl.col('^(id|.*_post)$')).rename(column_rename_post))
)

# result

shape: (8, 3)
┌─────┬─────┬──────────┐
│ id  ┆ age ┆ exposure │
│ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ f64      │
╞═════╪═════╪══════════╡
│ 123 ┆ 55  ┆ 0.49     │
│ 123 ┆ 56  ┆ 0.49     │
│ 456 ┆ 49  ┆ 0.8      │
│ 456 ┆ 50  ┆ 0.8      │
│ 123 ┆ 56  ┆ 0.51     │
│ 123 ┆ 57  ┆ 0.51     │
│ 456 ┆ 50  ┆ 0.2      │
│ 456 ┆ 51  ┆ 0.2      │
└─────┴─────┴──────────┘

There is a good Stack Overflow answer explaining the difference between concat, vstack and extend: How can I append or concatenate two dataframes in python polars?

I copy below the recommendations from that answer

There are the following ways to append data:

concat -> concatenate all given DataFrames. This is sort of a linked list of DataFrames. If you pass rechunk=True, all memory will be reallocated to contiguous chunks.

vstack -> Adds the data from other to DataFrame by incrementing a refcount. This is super cheap. It is recommended to call rechunk after many vstacks. Or simply use pl.concat.

extend This operation copies data. It tries to copy data from other to DataFrame. If however the refcount of DataFrame is larger than 1. A new buffer of memory is allocated to hold both DataFrames.

Luca
  • 1,216
  • 6
  • 10
1

You can use pl.concat_list then .explode.

out = df.with_columns(
    pl.concat_list(["age_pre", "age_post"]).alias("age"),
    pl.concat_list(["exposure_pre", "exposure_post"]).alias("exposure"),
).select(['id', 'age', 'exposure']).explode(['age', 'exposure'])
    
print(out)
shape: (8, 3)
┌─────┬─────┬──────────┐
│ id  ┆ age ┆ exposure │
│ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ f64      │
╞═════╪═════╪══════════╡
│ 123 ┆ 55  ┆ 0.49     │
│ 123 ┆ 56  ┆ 0.51     │
│ 123 ┆ 56  ┆ 0.49     │
│ 123 ┆ 57  ┆ 0.51     │
│ 456 ┆ 49  ┆ 0.8      │
│ 456 ┆ 50  ┆ 0.2      │
│ 456 ┆ 50  ┆ 0.8      │
│ 456 ┆ 51  ┆ 0.2      │
└─────┴─────┴──────────┘
ResidentSleeper
  • 2,385
  • 2
  • 10
  • 20