What would be the most efficient way to compute, in polars (python-polars), a rolling median absolute deviation (MAD) over a given period over a given "groupby key" ?
The context/the dataframe
I have the following dataframe:
- it contains 3 columns
- 1
datetime
column (typeDatetime
) - 1
group
column (typeUInt32
) - 1
value
column (typeUInt32
)
- 1
- the dataframe contains
~13 millions
rows- it contains some
values
from1500
distinctgroups
at the granularity of5mn
over1 month
5mn
data points for a full month is around8640
points8640
points for1500
groups =>8640 * 1500
~=13 millions
data points
- it contains some
┌─────────────────────┬───────┬───────┐
│ datetime ┆ group ┆ value │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ u32 ┆ u32 │
╞═════════════════════╪═══════╪═══════╡
│ 2023-04-17 14:50:00 ┆ 1 ┆ 8 │
│ 2023-04-17 14:50:00 ┆ 2 ┆ 4 │
│ 2023-04-17 14:50:00 ┆ 3 ┆ 14893 │
│ 2023-04-17 14:50:00 ┆ 4 ┆ 15 │
│ … ┆ … ┆ … │
│ 2023-05-17 14:45:00 ┆ 1497 ┆ 3 │
│ 2023-05-17 14:45:00 ┆ 1498 ┆ 8193 │
│ 2023-05-17 14:45:00 ┆ 1499 ┆ 42 │
│ 2023-05-17 14:45:00 ┆ 1500 ┆ 15 │
└─────────────────────┴───────┴───────┘
The processing
I now want to compute a rolling median absolute deviation over a week of data (i.e. 2016 observations as we are working at 5mn granularity) over every group.
I am currently able to compute very efficiently a rolling standard deviation with such kind of "expression":
(
df.
with_columns(
pl.col("value").rolling_std(2016).over("group").alias("std")
)
)
It gives me an answer in less than a second!
Unfortunately, the standard deviation is too sensitive to "outliers" and we would need to switch to a median absolute deviation (i.e. take the median of the absolute distance to the median instead of the mean of the distance to the mean).
The current approach
I gave a try with a rolling_apply + the computation of the median from the returned series but it is too slow (it is still processing after more than 5 minutes...compared to the 1s
of the rolling_std).
(
df
.with_columns(
pl.col("value").rolling_apply(lambda s: (s-s.median()).median(), window_size=2016).over("group").alias("mad")
)
)
Improving the performance
Is there a way to do it more efficiently and to reach the same performance of what we can have with rolling_std
?
(i understand that computing the mad can be slower than computing a standard deviation...but i mean having the same kind of order of magnitude...i.e. to be able to compute in several seconds).
UPDATE
As noticed by @Dean Mac Gregor, the original lambda i've put in The current approach
does not give the expected result (it returns always 0).
The proper lambda to have the expected result is rather lambda s: (s-s.median()).abs().median()
(the call to abs()
was missing).
The performance issue remains.
How to compute it more efficiently ?