3

I am having a very slow performance when calling groupby together with rolling and apply functions for a large dataframe in Pandas (1500682 rows). I am trying to obtain a rolling moving average with different weights.

The part of the code that is running slow is:

df['rolling'] = df.groupby('i2')['x'].rolling(3).apply(lambda x: x[-3]*0.1+x[-2]*0.9).reset_index(level=0, drop=True).reindex(df.index)

And the full code (with the data) is:

import pandas as pd
from random import randint


# data (it takes some time to create [less than 1 minute in my computer])
data1   = [[[[randint(0, 100) for i in range(randint(1, 2))] for i in range(randint(1, 3))] for i in range(5000)] for i in range(100)]
data2   = pd.DataFrame(
    [
        (i1, i2, i3, i4, x4)
        for (i1, x1) in enumerate(data1)
        for (i2, x2) in enumerate(x1)
        for (i3, x3) in enumerate(x2)
        for (i4, x4) in enumerate(x3)
    ],
    columns = ['i1', 'i2', 'i3', 'i4', 'x']
)
data2.drop(['i3', 'i4'], axis=1, inplace = True)
df   = data2.set_index(['i1', 'i2']).sort_index()


## conflicting part of the code ##
df['rolling'] = df.groupby('i2')['x'].rolling(3).apply(lambda x: x[-3]*0.1+x[-2]*0.9).reset_index(level=0, drop=True).reindex(df.index)

If you could elaborate on the code to make it more efficient and execute faster, I would really appreciate it.

Mario Arend
  • 459
  • 4
  • 16

1 Answers1

2

If I understand you correctly, you can try:

grp=df.groupby('i2')['x']
df['rolling']=grp.shift(2).mul(0.1).add(grp.shift(1).mul(0.9))

Now to elaborate:

Why not .apply(...):

When should I ever want to use pandas apply() in my code?

What you should do instead is to use anything that leverages vectorized operations. I put some more elaborated explanation about it here:

https://stackoverflow.com/a/60029108/11610186

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • Thank you Grzegorz Skibinski, I read you answer but still not sure of how you can skip from using apply all the time. I just posted a new question of a slow performace where I use stack(), groupby() and apply(). Here is link https://stackoverflow.com/questions/60176200/pandas-very-slow-performance-when-using-stack-groupby-and-apply – Mario Arend Feb 11 '20 at 19:41