1

I'm wanting to understand the polars way to create temporal sub-groups out of the groups from a groupby_rolling() operation.

I'm looking to do this keeping things parallel i.e. without using apply() (see that approach) and without using secondary/merging dataframes.

Example input:

┌─────┬─────────────────────┬───────┐
│ row ┆ date                ┆ price │
│ --- ┆ ---                 ┆ ---   │
│ i64 ┆ datetime[μs]        ┆ i64   │
╞═════╪═════════════════════╪═══════╡
│ 1   ┆ 2022-01-01 10:00:00 ┆ 10    │
│ 2   ┆ 2022-01-01 10:05:00 ┆ 20    │
│ 3   ┆ 2022-01-01 10:10:00 ┆ 30    │
│ 4   ┆ 2022-01-01 10:15:00 ┆ 40    │
│ 5   ┆ 2022-01-01 10:20:00 ┆ 50    │
│ 6   ┆ 2022-01-01 10:25:00 ┆ 60    │
│ 7   ┆ 2022-01-01 10:30:00 ┆ 70    │
│ 8   ┆ 2022-01-01 10:35:00 ┆ 80    │
│ 8   ┆ 2022-01-01 10:40:00 ┆ 90    │
│ 9   ┆ 2022-01-01 10:45:00 ┆ 100   │
│ 10  ┆ 2022-01-01 10:50:00 ┆ 110   │
│ 11  ┆ 2022-01-01 10:55:00 ┆ 120   │
│ 12  ┆ 2022-01-01 11:00:00 ┆ 130   │
└─────┴─────────────────────┴───────┘

Desired output:

┌─────┬─────────────────────┬───────┬──────────────────────────────────┐
│ row ┆ date                ┆ price ┆ 10_min_groups_mean_price_history │
│ --- ┆ ---                 ┆ ---   ┆ ---                              │
│ i64 ┆ datetime[μs]        ┆ i64   ┆ list[i64]                        │
╞═════╪═════════════════════╪═══════╪══════════════════════════════════╡
│ 1   ┆ 2022-01-01 10:00:00 ┆ 10    ┆ [10]                             │
│ 2   ┆ 2022-01-01 10:05:00 ┆ 20    ┆ [15]                             │
│ 3   ┆ 2022-01-01 10:10:00 ┆ 30    ┆ [25, 10]                         │
│ 4   ┆ 2022-01-01 10:15:00 ┆ 40    ┆ [35, 15]                         │
│ 5   ┆ 2022-01-01 10:20:00 ┆ 50    ┆ [45, 25, 10]                     │
│ 6   ┆ 2022-01-01 10:25:00 ┆ 60    ┆ [55, 35, 15]                     │
│ 7   ┆ 2022-01-01 10:30:00 ┆ 70    ┆ [65, 45, 25]                     │
│ 8   ┆ 2022-01-01 10:35:00 ┆ 80    ┆ [75, 55, 35]                     │
│ 8   ┆ 2022-01-01 10:40:00 ┆ 90    ┆ [85, 65, 45]                     │
│ 9   ┆ 2022-01-01 10:45:00 ┆ 100   ┆ [95, 75, 55]                     │
│ 10  ┆ 2022-01-01 10:50:00 ┆ 110   ┆ [105, 85, 65]                    │
│ 11  ┆ 2022-01-01 10:55:00 ┆ 120   ┆ [115, 95, 75]                    │
│ 12  ┆ 2022-01-01 11:00:00 ┆ 130   ┆ [125, 105, 85]                   │
└─────┴─────────────────────┴───────┴──────────────────────────────────┘

What is happening above?

  • A rolling window is applied over the dataframe producing a window per row.
  • Each window includes all rows within the last 30min (including the current row).
  • Then, each 30min window is devided into 10min sub-groups.
  • The mean price is calculated for each 10min sub-group
  • All mean prices from the sub-groups are returned as a list (most recent first) to the "10_min_groups_mean_price_history " column

Worked example (using row 5 as an example):

  • The rolling window for row 5 captures the previous 30min of data, which is rows 1 to 5
  • These rows are sub-grouped into 10min windows creating three sub-groups that capture rows [[5,4],[3,2],[1]]
  • The mean price of the rows in each sub-group is calculated and produced as a list → [45, 25, 10]

Mental model:

I'm conceptualising this as treating each window from a groupby_rolling() operation as a dataframe that can be computed as needed (in this case by performing a groupby_dynamic() operation on it, with the intent of returning aggregations on those sub-groups as a list), but not sure if that is the right way to think about it???

If the sub-group data was categorical it would be a simple case of using over() however I'm not aware of an equivalent when the requirement is to sub-group by time series?

I am also under the impression that this operation should be parallelisable as each window is independent from each other (its just more calc steps), but please point out if there's a reason it can't be.

Thanks in advance!

Full dummy data set:

If you want to run this with a realistic sized dataset you can use

df_dummy = pl.DataFrame({
    'date' : pl.date_range(
        datetime(2000, 1, 1, 9),
        datetime(2000, 1, 1, 16, 59, 59),
        timedelta(seconds=1),
        )
})
df_dummy = df_dummy.with_column(
    pl.Series(np.random.uniform(.5,.95,len(df_dummy)) * 100 ).alias('price')
)

Other ways that people might ask this question (for others searching):

  • groupby_dynamic() within groupby_rolling()
  • How to access polars RollingGroupBy[Dataframe] Object
  • Treat each groupby_rolling() window as a dataframe to aggrigate on
  • Nested dataframes within groupby context
  • Nested groupby contexts
demolish
  • 142
  • 6

1 Answers1

1

Could you .explode() the .groupby_rolling() - then use the resulting column for your .groupby_dynamic()?

(df.groupby_rolling(index_column="date", period="30m", closed="both")
   .agg(pl.col("date").alias("window"))
   .explode("window"))
shape: (70, 2)
┌─────────────────────┬─────────────────────┐
│ date                | window              │
│ ---                 | ---                 │
│ datetime[μs]        | datetime[μs]        │
╞═════════════════════╪═════════════════════╡
│ 2022-01-01 10:00:00 | 2022-01-01 10:00:00 │
│ 2022-01-01 10:05:00 | 2022-01-01 10:00:00 │
│ 2022-01-01 10:05:00 | 2022-01-01 10:05:00 │
│ 2022-01-01 10:10:00 | 2022-01-01 10:00:00 │
│ 2022-01-01 10:10:00 | 2022-01-01 10:05:00 │
│ 2022-01-01 10:10:00 | 2022-01-01 10:10:00 │
│ 2022-01-01 10:15:00 | 2022-01-01 10:00:00 │
│ 2022-01-01 10:15:00 | 2022-01-01 10:05:00 │
│ 2022-01-01 10:15:00 | 2022-01-01 10:10:00 │
│ 2022-01-01 10:15:00 | 2022-01-01 10:15:00 │
│ ...                 | ...                 │
│ 2022-01-01 10:55:00 | 2022-01-01 10:45:00 │
│ 2022-01-01 10:55:00 | 2022-01-01 10:50:00 │
│ 2022-01-01 10:55:00 | 2022-01-01 10:55:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:30:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:35:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:40:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:45:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:50:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:55:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 11:00:00 │
└─────────────────────┴─────────────────────┘

Something along the lines of:

[Edit: Removed the unneeded .join() per @ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's help.]

(df.groupby_rolling(index_column="date", period="30m", closed="both")
   .agg([pl.col("date").alias("window"), pl.col("price")])
   .explode(["window", "price"])
   .groupby_dynamic(by="date", index_column="window", every="10m", closed="right")
   .agg(pl.col("price")) # pl.col("price").mean()
   .groupby("date", maintain_order=True)
   .agg(pl.all()))
shape: (13, 3)
┌─────────────────────┬─────────────────────────────────────┬──────────────────────────────────┐
│ date                | window                              | price                            │
│ ---                 | ---                                 | ---                              │
│ datetime[μs]        | list[datetime[μs]]                  | list[list[i64]]                  │
╞═════════════════════╪═════════════════════════════════════╪══════════════════════════════════╡
│ 2022-01-01 10:00:00 | [2022-01-01 09:50:00]               | [[10]]                           │
│ 2022-01-01 10:05:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20]]                     │
│ 2022-01-01 10:10:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30]]                 │
│ 2022-01-01 10:15:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30], [40]]           │
│ 2022-01-01 10:20:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30], [40, 50]]       │
│ 2022-01-01 10:25:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30], ... [60]]       │
│ 2022-01-01 10:30:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30], ... [60, 70]]   │
│ 2022-01-01 10:35:00 | [2022-01-01 10:00:00, 2022-01-01... | [[20, 30], [40, 50], ... [80]]   │
│ 2022-01-01 10:40:00 | [2022-01-01 10:00:00, 2022-01-01... | [[30], [40, 50], ... [80, 90]]   │
│ 2022-01-01 10:45:00 | [2022-01-01 10:10:00, 2022-01-01... | [[40, 50], [60, 70], ... [100]]  │
│ 2022-01-01 10:50:00 | [2022-01-01 10:10:00, 2022-01-01... | [[50], [60, 70], ... [100, 110]] │
│ 2022-01-01 10:55:00 | [2022-01-01 10:20:00, 2022-01-01... | [[60, 70], [80, 90], ... [120]]  │
│ 2022-01-01 11:00:00 | [2022-01-01 10:20:00, 2022-01-01... | [[70], [80, 90], ... [120, 130]] │
└─────────────────────┴─────────────────────────────────────┴──────────────────────────────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Thanks for this response, never thought about doing it this way so it's interesting to think about it differently. I've tested this approach and it takes about 3.5 seconds on a sample of approx. 28,000 rows. Not sure if that's considered performant or not for Polars (?) but I can probably get by with that for a while. – demolish Jan 26 '23 at 00:41
  • There are 3 uses of groupby, an explode and join - so it's quite possible there is a smarter/more efficient approach to produce the same result - this is just how I parsed the question. – jqurious Jan 26 '23 at 00:55
  • 1
    With regards to performant: I just tried this approach on 44,000 rows and it takes 176 milliseconds for me: `Wall time: 176 ms shape: (44641, 3)` - but perhaps your actual data differs from the example? – jqurious Jan 27 '23 at 02:51
  • 1
    I only found one small suggestion. Otherwise, great answer! – ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ Jan 28 '23 at 22:34
  • Thanks ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ for the optimisation. Thanks jqurious for checking. I've edited my question to add the code to create the full dummy dataset I'm using. I'm running the updated solution on a M1 mac and seeing it come in at about 1.03 seconds - interested to know if you still see ~176ms with this data. Assuming you're not on running on something special would not have thought laptop hardware would make this much of a difference. – demolish Jan 29 '23 at 07:28
  • With the dummy dataset it's ~3 seconds for me. It seems `f64` versus `i64` is the difference. – jqurious Jan 29 '23 at 07:54
  • Looking at the way the data ends up grouped - with all the repeating patterns and each row nearly being a subset of the next - it feels like there should be some "math way" to improve this. – jqurious Jan 29 '23 at 08:04