0

I am trying to follow advice from this question

df = pl.DataFrame({'a':[1, 2, 3], 'b':[4,5,6]})
df.select([pl.all().map(np.log2)])
shape: (3, 2)
┌──────────┬──────────┐
│ a        ┆ b        │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.0      ┆ 2.0      │
│ 1.0      ┆ 2.321928 │
│ 1.584963 ┆ 2.584963 │
└──────────┴──────────┘

So far, so good. But:

from sklearn.preprocessing import minmax_scale
>>> df.select(pl.all().map(minmax_scale))
shape: (1, 2)
┌─────────────────┬─────────────────┐
│ a               ┆ b               │
│ ---             ┆ ---             │
│ list[f64]       ┆ list[f64]       │
╞═════════════════╪═════════════════╡
│ [0.0, 0.5, 1.0] ┆ [0.0, 0.5, 1.0] │
└─────────────────┴─────────────────┘

I found a way of converting the pl.List back, but it seems strange that this step is needed.

df.select(pl.all().map(minmax_scale)).explode(pl.all())
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘

Both minmax_scale and np.log2 return arrays, so I would expect the behavior to be the same. What is the proper way of doing this?

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
Matias Andina
  • 4,029
  • 4
  • 26
  • 58

2 Answers2

2

Alternatively: why not do the scaling math yourself instead of a map or apply, so polars can multithread it?

df.select(pl.all().log(2))
df.select((pl.all() - pl.all().min()) / (pl.all().max() - pl.all().min()))
Wayoshi
  • 1,688
  • 1
  • 7
  • This is true for this use case, but "doing the math myself" is not always as straightforward as with these examples. – Matias Andina Jul 10 '23 at 20:47
  • I took the question more in the vein of "how to use an arbitrary function" than focusing on the one function they happened to pick out. – Dean MacGregor Jul 10 '23 at 21:16
1

Try to do this...

minmax_scale(df['a'])
# array([0. , 0.5, 1. ])

Now do...

np.log2(df['a'])
shape: (3,)
Series: 'a' [f64]
[
    0.0
    1.0
    1.584963
]

Notice how with log2 you get back a Series not an array. The difference is that log2 is a true ufunc and so the output remains a series. You can see this directly with:

type(np.log2)
# numpy.ufunc

type(minmax_scale)
# function

You can do this to avoid the explode:

df.select(pl.all().map(lambda x: pl.Series(minmax_scale(x))))

or you can just define your own func in one line

my_minmax_scale = lambda x: pl.Series(minmax_scale(x))
df.select(pl.all().map(my_minmax_scale))
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • Very clear! The `lambda x: pl.Series(...)` reads much better to me than using the explode. Would this be the recommended way for any custom function as you show? Let's say I have a function that has several steps, just wrap your `np.array` output in a `pl.Series` – Matias Andina Jul 10 '23 at 20:44
  • I don't know if I'm in a position to say anything is "*the* recommended" way to do it but, yes, I think wrapping a custom function output in pl.Series is a good effective way of getting the result you're after. – Dean MacGregor Jul 10 '23 at 20:53