(related to this answer)
Given a df
, I was expecting to get the results of df.expanding()
and to perform some multivariate operations on this (operations involving several columns of df
simultaneously over an expanding window of rows) using .apply()
. It turns out this is not possible.
So, much as in the answer linked above, I need to use numpy.as_strides
of df
. Except, in contrast to the question linked above, using strides to get an expanding views of my df
, instead of rolling ones (an expanding window has a fixed left side and the right side moves to the right incrementally).
consider this df
:
import numpy
import pandas
df = pandas.DataFrame(numpy.random.normal(0, 1, [100, 2]), columns=['size_A', 'size_B']).cumsum(axis=0)
consider this code to extract a rolling window of W
rows of that df
(this comes from the answer above):
def get_sliding_window(df, W):
a = df.values
s0,s1 = a.strides
m,n = a.shape
return numpy.lib.stride_tricks\
.as_strided(a,shape=(m-W+1,W,n),strides=(s0,s0,s1))
roll_window = get_sliding_window(df, W = 3)
roll_window[2]
Now I want to modify get_sliding_window
to get it to return
an expanding window of a df (instead of a rolling one):
def get_expanding_window(df):
a = df.values
s0,s1 = a.strides
m,n = a.shape
out = numpy.lib.stride_tricks\
.as_strided(a, shape=(m,m,n),strides=(s0,s0,s1))
return out
expg_window = get_expanding_window(df)
expg_window[2]
But I'm not using the arguments of as_strided
correctly: I can't seem to be getting the right matrices --that would be something like:
[df.iloc[0:1].values ,df.iloc[0:2].values, df.iloc[0:3].values,...]
Edit:
In a comment @ThomasKühn proposes to use list comprehension. This would solve the problem, but too slowly. What is the cost?
One a vector valued function, we can compare the cost
of list comprehension with .expand()
. It ain't small:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10000)), columns=['Value'])
%timeit method_1 = numpy.array([df.Value.iloc[range(j + 1)].sum() for j in range(df.shape[0])])
gives:
6.37 s ± 219 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
comparing to .expanding()
:
%timeit method_2 = df.expanding(0).apply(lambda x: x.sum())
which gives:
35.5 ms ± 356 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally, there are more details as to the problem I am trying to solve in the comments to this question.