2

I have a large dataframe, simplified as:

>>> df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))

    A  B  C  D
0   1  0  1  0
1   0  0  0  0
2   1  0  0  0
3   0  1  1  1
4   0  1  1  1
.. .. .. .. ..
95  1  0  0  1
96  0  1  1  0
97  0  0  1  1
98  1  1  1  0
99  0  0  0  0

I want to get a new dataframe that I'll later use as a mask to filter out some values. This mask should show the number of nonzero elements in a rolling window size of 10.

My solution I am using is:

df.rolling(10).apply(lambda x: x.astype(bool).sum(axis=0))

which does the job but my original dataframe is very large, so I'm trying to optimize this process if possible because for millions of values it takes quite a lot of time. I thought of moving the astype(bool) part before the rolling window creation but it seems I'd still need to have that apply(lambda ...) construct which is the real efficiency-bottleneck here.

lazarea
  • 1,129
  • 14
  • 43
  • 2
    Yeah in general avoid the `.apply` in favor of built-ins. https://stackoverflow.com/a/63307380/4333359 (same rules apply for `rolling` as `groupby`) – ALollz Mar 04 '21 at 18:54

1 Answers1

3

Not using apply could help:

(df != 0).rolling(10).sum()

Timing (for 1000 records):

%%time
_ = df.rolling(10).apply(lambda x: x.astype(bool).sum(axis=0))

CPU times: user 517 ms, sys: 13.6 ms, total: 531 ms
Wall time: 522 ms
%%time
_ = (df != 0).rolling(10).sum()

CPU times: user 2.04 ms, sys: 315 µs, total: 2.35 ms
Wall time: 1.45 ms
perl
  • 9,826
  • 1
  • 10
  • 22