4

Environment

macos:             monterey
node:              v18.1.0
nodejs-polars:     0.5.3

Goal

Subtract every column in a polars DataFrame with the mean of that column.

Pandas solution

In pandas the solution is very concise thanks to DataFrame.sub(other, axis='columns', level=None, fill_value=None). other is scalar, sequence, Series, or DataFrame:

df.sub(df.mean())
df - df.mean()

nodejs-polars solution

While in nodejs-polars function, other only seems to be a Series according to sub: (other) => wrap("sub", prepareOtherArg(other).inner()).

1. Prepare data

console.log(df)
┌─────────┬─────────┬─────────┬─────────┐
│   A     ┆   B     ┆   C     ┆   D     │
│ ---     ┆ ---     ┆ ---     ┆ ---     │
│ i64     ┆ i64     ┆ i64     ┆ i64     │
╞═════════╪═════════╪═════════╪═════════╡
│ 13520   ┆ -16     ┆ 384     ┆ 208     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472   ┆ -16     ┆ 384     ┆ 176     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13456   ┆ -16     ┆ 368     ┆ 160     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472   ┆ -16     ┆ 368     ┆ 160     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472   ┆ -16     ┆ 352     ┆ 176     │
└─────────┴─────────┴─────────┴─────────┘
console.log(df.mean())
┌─────────┬─────────┬─────────┬─────────┐
│   A     ┆   B     ┆   C     ┆   D     │
│ ---     ┆ ---     ┆ ---     ┆ ---     │
│ f64     ┆ f64     ┆ f64     ┆ f64     │
╞═════════╪═════════╪═════════╪═════════╡
│ 13478.4 ┆ -16.0   ┆ 371.2   ┆ 176.0   │
└─────────┴─────────┴─────────┴─────────┘

2. first try

df.sub(df.mean())

Error: Failed to determine supertype of Int64 and Struct([Field { name: "A", dtype: Int32 }, Field { name: "B", dtype: Int32 }, Field { name: "C", dtype: Int32 }, Field { name: "D", dtype: Int32 }])

3. second try

df.sub(pl.Series(df.mean().row(0)))

Program crashes due to memory problems.

4. third try

After some investigations, I noticed the tests:

test("sub", () => {
  const actual = pl.DataFrame({
    "foo": [1, 2, 3],
    "bar": [4, 5, 6]
  }).sub(1);
  const expected = pl.DataFrame({
    "foo": [0, 1, 2],
    "bar": [3, 4, 5]
  });
  expect(actual).toFrameEqual(expected);
});
test("sub:series", () => {
  const actual = pl.DataFrame({
    "foo": [1, 2, 3],
    "bar": [4, 5, 6]
  }).sub(pl.Series([1, 2, 3]));
  const expected = pl.DataFrame({
    "foo": [0, 0, 0],
    "bar": [3, 3, 3]
  });
  expect(actual).toFrameEqual(expected);
});

nodejs-polars seems to be unable to complete this task gracefully right now. So my current solution is a bit cumbersome: perform operations column by column then concat the results.

pl.concat(df.columns.map((col) => df.select(col).sub(df.select(col).mean(0).toSeries())), {how:'horizontal'})

Is there a better or easier way to do it?

5. new try

I just came out an easier solution, but it's hard to understand, and I'm still trying to figure out what happened under the hood.

df.select(pl.col('*').sub(pl.col('*').mean()))
Valley
  • 51
  • 5

1 Answers1

7

You tagged this problem with [python-polars], so I'll provide a solution using Polars with Python. (Perhaps you can translate that to Node-JS.)

Starting with our data:

import polars as pl

df = pl.DataFrame(
    {
        "A": [13520, 13472, 13456, 13472, 13472],
        "B": [-16, -16, -16, -16, -16],
        "C": [384, 384, 368, 368, 352],
        "D": [208, 176, 160, 160, 176],
    }
)
df
shape: (5, 4)
┌───────┬─────┬─────┬─────┐
│ A     ┆ B   ┆ C   ┆ D   │
│ ---   ┆ --- ┆ --- ┆ --- │
│ i64   ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╪═════╡
│ 13520 ┆ -16 ┆ 384 ┆ 208 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 384 ┆ 176 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13456 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 352 ┆ 176 │
└───────┴─────┴─────┴─────┘

We can very concisely solve this problem:

df.with_columns([
    (pl.all() - pl.all().mean()).suffix('_centered')
])
shape: (5, 8)
┌───────┬─────┬─────┬─────┬────────────┬────────────┬────────────┬────────────┐
│ A     ┆ B   ┆ C   ┆ D   ┆ A_centered ┆ B_centered ┆ C_centered ┆ D_centered │
│ ---   ┆ --- ┆ --- ┆ --- ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│ i64   ┆ i64 ┆ i64 ┆ i64 ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
╞═══════╪═════╪═════╪═════╪════════════╪════════════╪════════════╪════════════╡
│ 13520 ┆ -16 ┆ 384 ┆ 208 ┆ 41.6       ┆ 0.0        ┆ 12.8       ┆ 32.0       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 384 ┆ 176 ┆ -6.4       ┆ 0.0        ┆ 12.8       ┆ 0.0        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13456 ┆ -16 ┆ 368 ┆ 160 ┆ -22.4      ┆ 0.0        ┆ -3.2       ┆ -16.0      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 368 ┆ 160 ┆ -6.4       ┆ 0.0        ┆ -3.2       ┆ -16.0      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 352 ┆ 176 ┆ -6.4       ┆ 0.0        ┆ -19.2      ┆ 0.0        │
└───────┴─────┴─────┴─────┴────────────┴────────────┴────────────┴────────────┘

If you want to overwrite the columns, you can eliminate the suffix expression:

df.with_columns([
    (pl.all() - pl.all().mean())
])
shape: (5, 4)
┌───────┬─────┬───────┬───────┐
│ A     ┆ B   ┆ C     ┆ D     │
│ ---   ┆ --- ┆ ---   ┆ ---   │
│ f64   ┆ f64 ┆ f64   ┆ f64   │
╞═══════╪═════╪═══════╪═══════╡
│ 41.6  ┆ 0.0 ┆ 12.8  ┆ 32.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4  ┆ 0.0 ┆ 12.8  ┆ 0.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -22.4 ┆ 0.0 ┆ -3.2  ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4  ┆ 0.0 ┆ -3.2  ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4  ┆ 0.0 ┆ -19.2 ┆ 0.0   │
└───────┴─────┴───────┴───────┘

Edit: essentially, the polars.all or the polars.col('*') replicates an entire expression for each column, so that:

pl.col('*') - pl.col('*').mean()

is syntactic sugar for:

[
  pl.col('A') - pl.col('A').mean(),
  pl.col('B') - pl.col('B').mean(),
  pl.col('C') - pl.col('C').mean(),
  pl.col('D') - pl.col('D').mean(),
]
  • 2
    Awesome! thank you very much, translated: `df.withColumns(pl.col('*').sub(pl.col('*').mean()))` or `df.select(pl.col('*').sub(pl.col('*').mean()))` – Valley Jun 08 '22 at 04:49
  • I provided some further explanation in an edit. –  Jun 08 '22 at 04:51