I have a large dataframe, simplified as:
>>> df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
A B C D
0 1 0 1 0
1 0 0 0 0
2 1 0 0 0
3 0 1 1 1
4 0 1 1 1
.. .. .. .. ..
95 1 0 0 1
96 0 1 1 0
97 0 0 1 1
98 1 1 1 0
99 0 0 0 0
I want to get a new dataframe that I'll later use as a mask to filter out some values. This mask should show the number of nonzero elements in a rolling window size of 10.
My solution I am using is:
df.rolling(10).apply(lambda x: x.astype(bool).sum(axis=0))
which does the job but my original dataframe is very large, so I'm trying to optimize this process if possible because for millions of values it takes quite a lot of time. I thought of moving the astype(bool)
part before the rolling window creation but it seems I'd still need to have that apply(lambda ...)
construct which is the real efficiency-bottleneck here.