How can I speed up a pandas groupby with apply and rolling(n, min_periods=k)?

Question

In pandas, do you have any ideas for speedups when combining groupby and apply and also using rolling with the min_periods argument?

Various people have mentioned speedup methods when combining groupby and apply. For example, this page describes a very fast way to calculate a weighted moving average. However, as far as I could find, I could not find any method that assumes the use of the min_periods argument.

My sample code is below, but if I run it as is, it takes more than 15 seconds in my environment.

from string import ascii_letters

import numpy as np
import pandas as pd
from numpy.random import choice

N = 15_000_000
np.random.seed(123)
letters = list(ascii_letters)
words = ["".join(choice(letters, 5)) for _ in range(30)]

df = pd.DataFrame({
        "hoge": choice(words, N),
        "fuga": choice(words, N),
        "piyo": choice(words, N),
        "metricA": np.random.rand(N),
        "metricB": np.random.rand(N),})

# This code takes over 15 seconds in my env!
func = lambda group: group.shift(1).rolling(3, min_periods=1).mean()
df.groupby(['hoge', 'fuga', 'piyo'])[['metricA', 'metricB']].apply(func)

Please provide enough code so others can better understand or reproduce the problem. — Community, May 19 '22 at 14:13
My experience, `merge` has significant performance improvement compare to using (`groupby` then `apply/transform`). My suggest is if you can use `loc` to extract subsample and calculate on the subsample, then `merge` it back will save a lot of time — PTQuoc, May 20 '22 at 12:11
Thank you for your response. However, that solution does not seem to work well in my case. I tried to write a code using `itertools.product` function to specify all combinations of `hoge`, `fuga` and `piyo` using `loc`. But unfortunately this code was slower than using `groupby` and `apply`. Maybe I am not fully understanding what you are saying. — RiK, May 21 '22 at 05:48
Oh the days where 15 seconds was a long time... that's barely enough time for a proper multi-threaded solution to get started xD — BeRT2me, May 21 '22 at 05:55

How can I speed up a pandas groupby with apply and rolling(n, min_periods=k)?

0 Answers0