0

What would be the most efficient way to compute, in polars (python-polars), a rolling median absolute deviation (MAD) over a given period over a given "groupby key" ?

The context/the dataframe

I have the following dataframe:

  • it contains 3 columns
    • 1 datetime column (type Datetime)
    • 1 group column (type UInt32)
    • 1 value column (type UInt32)
  • the dataframe contains ~13 millions rows
    • it contains some values from 1500 distinct groups at the granularity of 5mn over 1 month
    • 5mn data points for a full month is around 8640 points
    • 8640 points for 1500 groups => 8640 * 1500 ~= 13 millions data points
┌─────────────────────┬───────┬───────┐
│ datetime            ┆ group ┆ value │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ u32   ┆ u32   │
╞═════════════════════╪═══════╪═══════╡
│ 2023-04-17 14:50:00 ┆ 1     ┆ 8     │
│ 2023-04-17 14:50:00 ┆ 2     ┆ 4     │
│ 2023-04-17 14:50:00 ┆ 3     ┆ 14893 │
│ 2023-04-17 14:50:00 ┆ 4     ┆ 15    │
│ …                   ┆ …     ┆ …     │
│ 2023-05-17 14:45:00 ┆ 1497  ┆ 3     │
│ 2023-05-17 14:45:00 ┆ 1498  ┆ 8193  │
│ 2023-05-17 14:45:00 ┆ 1499  ┆ 42    │
│ 2023-05-17 14:45:00 ┆ 1500  ┆ 15    │
└─────────────────────┴───────┴───────┘

The processing

I now want to compute a rolling median absolute deviation over a week of data (i.e. 2016 observations as we are working at 5mn granularity) over every group.

I am currently able to compute very efficiently a rolling standard deviation with such kind of "expression":

(
    df.
    with_columns(
        pl.col("value").rolling_std(2016).over("group").alias("std")
    )
)

It gives me an answer in less than a second!

Unfortunately, the standard deviation is too sensitive to "outliers" and we would need to switch to a median absolute deviation (i.e. take the median of the absolute distance to the median instead of the mean of the distance to the mean).

The current approach

I gave a try with a rolling_apply + the computation of the median from the returned series but it is too slow (it is still processing after more than 5 minutes...compared to the 1s of the rolling_std).

(
    df
    .with_columns(
        pl.col("value").rolling_apply(lambda s: (s-s.median()).median(), window_size=2016).over("group").alias("mad")
    )
)

Improving the performance

Is there a way to do it more efficiently and to reach the same performance of what we can have with rolling_std ?

(i understand that computing the mad can be slower than computing a standard deviation...but i mean having the same kind of order of magnitude...i.e. to be able to compute in several seconds).

UPDATE

As noticed by @Dean Mac Gregor, the original lambda i've put in The current approach does not give the expected result (it returns always 0).
The proper lambda to have the expected result is rather lambda s: (s-s.median()).abs().median() (the call to abs() was missing).
The performance issue remains.
How to compute it more efficiently ?

Julien
  • 55
  • 7
  • 1
    When I run your lambda on random numbers I just get a bunch of nulls and 0s so it's not clear what is supposed to happen without sample data – Dean MacGregor May 30 '23 at 12:43
  • indeed, sorry, i've missed a call to `abs()` in my lambda. it should be `lambda s: (s-s.median()).abs().median()`. when i do this, i get the expected result but it is too slow for my use case. – Julien May 30 '23 at 14:58

2 Answers2

1

I think you can use .cumulative_eval() for this.

s = pl.element().tail(2016)

df.with_columns(mad = 
   pl.col("value")
     .cumulative_eval((s - s.median()).abs().median(), parallel=True)
     .over("group")
)

You will get all the values, as if you supplied min_periods=1 to the rolling method.

jqurious
  • 9,953
  • 1
  • 4
  • 14
  • thanks a lot @jqurious. I was not aware about `cumulative_eval()` and indeed it gives me the correct answer and with the best performance i can reach so far (on my original dataframe (13millions rows & more than 1000 groups) i have the answer in around 1 minute (all my 8 cores are busy during the computation)....looks like there is around a factor 10 improvements compared to the `rolling_apply` approach which i guess is coming mainly from the parallelization that i do not have with the `rolling_apply`. – Julien May 30 '23 at 20:21
  • i am wondering if there is a way to get "another order of magnitude" improvement....if i take "rolling_std" example...with the `rolling_apply` approach, it answers in several minutes (1 core used)....with the `cumulative_eval()` approach, it is ~10s (my 8 cores being used)...with the native `rolling_std`, it is < 1s ! so, for the rolling_std, a factor 10 between `cumulative_eval` and the native `rolling_std`. i am first not 100% clear about this additional gain (more vectorization?) and can we have the same gain for a rolling mad (maybe only possible if natively implemented in rust?) – Julien May 30 '23 at 20:49
  • @Julien This is my first time seeing "mad", but according to https://stackoverflow.com/a/42867926 it is a vectorizable operation? So I assume it should be possible to do something similar in polars. – jqurious May 30 '23 at 21:11
  • @Julien It could also be worthwhile asking on [the issues tracker](https://github.com/pola-rs/polars/issues/) if you do not receive further feedback here as it's a rather specific performance related issue. – jqurious May 30 '23 at 22:28
  • 1
    thanks will do it for the issue tracker. The link you’re referring to on stackoverflow refers to the mean absolute deviation and not the median absolute deviation. It may be confusing as they are sometimes both mentioned with the same acronym (mad). The median absolute deviation is a robust estimator of dispersion…robust in a sense of less sensitive to outliers…and as such is very useful in a context of anomaly detection. I guess there is something to do in polars. There is already the `rolling_median` which is very efficient…and here we’re talking about a median to the median. – Julien May 31 '23 at 05:01
0

Let's start with some sample data

import polars as pl
import numpy as np
from datetime import datetime
np.random.seed(0)
df=pl.DataFrame({'datetime':pl.date_range(datetime(2023,4,17,14,50), 
                                          datetime(2023,5,17,14,45), 
                                          '5m',eager=True)}) \
    .with_columns(group=pl.lit([[x for x in range(5)]])).explode('group') \
    .with_columns(value=pl.lit(np.random.randint(0,10000,size=43200)))

then your mad formula is

df=(
    df
    .with_columns(
        pl.col("value").rolling_apply(lambda s: (s-s.median()).abs().median(), window_size=2016).over("group").alias("mad")
    )
)

This takes about 5.8sec for me.

You can use groupby_rolling with an agg to get the functionality of rolling_* functions.

That'll look like this:

(
    df
        .groupby_rolling('datetime', period='7d',by='group')
        .agg(
             gbrmad=(pl.col('value')-pl.col('value').median()).abs().median()
             )
)

this takes 1 sec.

I notice that your lambda function returns an int and more specifically that it's always lower than what mine returns. If that is intended and not just an after effect then you can do:

(
    df
        .groupby_rolling('datetime', period='7d',by='group')
        .agg(
             gbrmad=(pl.col('value')-pl.col('value').median()).abs().median().floor().cast(pl.Int64())
             )
)

Lastly, you're doing a fixed window size of 2016 instead of explicitly a week. As such, you get nulls in the early part of the dataset. If you want to keep that behavior then we can do a when like this.

(
    df
    .groupby_rolling('datetime', period='7d',by='group')
    .agg(
            gbrmad=pl.when(pl.col('value').count()==2016)
                    .then((pl.col('value')-pl.col('value').median()).abs().median().floor().cast(pl.Int64()))
            ))

To check results:

df.join(
    df
    .groupby_rolling('datetime', period='7d',by='group')
    .agg(
            gbrmad=pl.when(pl.col('value').count()==2016)
                    .then((pl.col('value')-pl.col('value').median()).abs().median().floor().cast(pl.Int64()))
            ),
    on=['datetime','group']).filter(pl.col('mad')!=pl.col('gbrmad'))

0 rows where the lambda based mad calc is different with about a 5x improvement in speed.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • thank you @Dean for the answer. indeed, i've realized that the rolling_apply example i've put is giving 0 which is not what i'm expecting (we should have the median of the "rolling" distance to the median -> it should not always be 0). not clear about why it is 0 ? and then there will be the "performance" aspect. – Julien May 30 '23 at 14:27
  • ok...the original `lambda` returns all 0, because i've missed the absolute value during the computation of the distance to the median. the lambda should rather be `lambda s: (s-s.median()).abs().median()` – Julien May 30 '23 at 14:59
  • thanks a lot for your time on this one (your way of building the sample df is really nice). regarding the use of groupby_rolling, thanks for the approach. unfortunately on my laptop (macbook pro quad-core i7 2.3GHz), i do not have the same result. the 2 approaches are taking the same time. the one with rolling_apply ends in around 2.5 to 3s and the groupby_rolling gives the same time (i.e. around 3s which is surprisingly slower than on your side whereas the rolling_apply is quicker than you on my laptop). – Julien May 30 '23 at 20:10