3

(related to this answer)

Given a df, I was expecting to get the results of df.expanding() and to perform some multivariate operations on this (operations involving several columns of df simultaneously over an expanding window of rows) using .apply(). It turns out this is not possible.

So, much as in the answer linked above, I need to use numpy.as_strides of df. Except, in contrast to the question linked above, using strides to get an expanding views of my df, instead of rolling ones (an expanding window has a fixed left side and the right side moves to the right incrementally).

consider this df:

import numpy
import pandas


df = pandas.DataFrame(numpy.random.normal(0, 1, [100, 2]), columns=['size_A', 'size_B']).cumsum(axis=0)

consider this code to extract a rolling window of W rows of that df (this comes from the answer above):

def get_sliding_window(df, W):
    a = df.values                 
    s0,s1 = a.strides
    m,n = a.shape
    return numpy.lib.stride_tricks\
               .as_strided(a,shape=(m-W+1,W,n),strides=(s0,s0,s1))

roll_window = get_sliding_window(df, W = 3)
roll_window[2] 

Now I want to modify get_sliding_window to get it to return an expanding window of a df (instead of a rolling one):

def get_expanding_window(df):
    a = df.values                 
    s0,s1 = a.strides
    m,n = a.shape
    out = numpy.lib.stride_tricks\
               .as_strided(a, shape=(m,m,n),strides=(s0,s0,s1))
    return out

expg_window = get_expanding_window(df)
expg_window[2]

But I'm not using the arguments of as_strided correctly: I can't seem to be getting the right matrices --that would be something like:

[df.iloc[0:1].values ,df.iloc[0:2].values, df.iloc[0:3].values,...]  

Edit:

In a comment @ThomasKühn proposes to use list comprehension. This would solve the problem, but too slowly. What is the cost?

One a vector valued function, we can compare the cost of list comprehension with .expand(). It ain't small:

numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10000)), columns=['Value'])
%timeit method_1 = numpy.array([df.Value.iloc[range(j + 1)].sum() for j in range(df.shape[0])])

gives:

6.37 s ± 219 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

comparing to .expanding():

%timeit method_2 = df.expanding(0).apply(lambda x: x.sum())

which gives:

35.5 ms ± 356 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Finally, there are more details as to the problem I am trying to solve in the comments to this question.

user189035
  • 5,589
  • 13
  • 52
  • 112
  • Do you mean you want the left side of the window to be fixed and the right side to move to the right? – Thomas Kühn Feb 16 '18 at 08:38
  • @ThomasKühn: yes (edited the question to make this more clear). Thanks! – user189035 Feb 16 '18 at 10:18
  • I think what you are trying to do is not possible with `strides`, as they seem to assume fixed length. I think what you want are `slices`. To me this looks a bit like an [xy problem](https://meta.stackexchange.com/q/66377/358966). Can you maybe elaborate a bit more on what you are trying to accomplish? – Thomas Kühn Feb 16 '18 at 10:36
  • How about this: `indices = [i for j in range(1,5) for i in range(j)]` and then `df.iloc[indices]`. Is that what you want? – Thomas Kühn Feb 16 '18 at 10:45
  • ...but do you need to *keep* these slices, or are you just doing some calculations on them? Just accessing a slice of an array (e.g. `a[3:5]`) only creates a [view](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.view.html) of the array, so no data is copied. See also the [documentation](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html): "All arrays generated by basic slicing are always views of the original array." – Thomas Kühn Feb 16 '18 at 11:06
  • Forget about the 'indices=' comment, it's not going to be good for you. You want to go with slices. Something like `result = [your_function(a[0:i]) for i in range(1,max_i)]`. It's really hard to give a good example without a concrete idea of what you are up to. – Thomas Kühn Feb 16 '18 at 11:13
  • If we forget about speed at first (and the DataFrame, which appears to be unnecessary for this problem), does this code calculate what you want? `x = np.random.normal(0,1,(10000,2)); res = [np.sum(x[:i,0] > x[i,1]) for i in range(x.shape[0])]` – Thomas Kühn Feb 16 '18 at 19:49
  • @ThomasKühn: yes, speed is the only remaining issue – user189035 Feb 17 '18 at 19:17

1 Answers1

3

I wrote a few functions that are all supposed to do the same thing, but need different amounts of time to complete the task:

import timeit
import numba as nb

x = np.random.normal(0,1,(10000,2))
def f1():
    res = [np.sum(x[:i,0] > x[i,1]) for i in range(x.shape[0])]
    return res

def f2():
    buf = np.empty(x.shape[0])
    res = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        buf[:i] = x[:i,0] > x[i,1]
        res[i] = np.sum(buf[:i])
    return res

def f3():
    res = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        res[i] = np.sum(x[:i,0] > x[i,1])
    return res


@nb.jit(nopython=True)
def f2_nb():
    buf = np.empty(x.shape[0])
    res = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        buf[:i] = x[:i,0] > x[i,1]
        res[i] = np.sum(buf[:i])
    return res

@nb.jit(nopython=True)
def f3_nb():
    res = np.empty(x.shape[0])
    for i in range(x.shape[0]):
        res[i] = np.sum(x[:i,0] > x[i,1])
    return res

##checking that all functions give the same result:
print('checking correctness')
print(np.all(f1()==f2()))
print(np.all(f1()==f3()))
print(np.all(f1()==f2_nb()))
print(np.all(f1()==f3_nb()))

print('+'*50)
print('performance tests')
print('f1()')        
print(min(timeit.Timer(
    'f1()',
    setup = 'from __main__ import f1,x',
).repeat(7,10)))

print('-'*50)
print('f2()')
print(min(timeit.Timer(
    'f2()',
    setup = 'from __main__ import f2,x',
).repeat(7,10)))

print('-'*50)
print('f3()')
print(min(timeit.Timer(
    'f3()',
    setup = 'from __main__ import f3,x',
).repeat(7,10)))

print('-'*50)
print('f2_nb()')
print(min(timeit.Timer(
    'f2_nb()',
    setup = 'from __main__ import f2_nb,x',
).repeat(7,10)))

print('-'*50)
print('f3_nb()')
print(min(timeit.Timer(
    'f3_nb()',
    setup = 'from __main__ import f3_nb,x',
).repeat(7,10)))

As you can see, the differences are not big, but there are some differences in performance. The last two functions are just 'duplicates' of earlier ones, but using numba optimisation. The results of the speed tests are

checking correctness
True
True
True
True
++++++++++++++++++++++++++++++++++++++++++++++++++
performance tests
f1()
2.02294262702344
--------------------------------------------------
f2()
3.0964318679762073
--------------------------------------------------
f3()
1.9573561699944548
--------------------------------------------------
f2_nb()
1.3796060049789958
--------------------------------------------------
f3_nb()
0.48667875200044364

As you can see, the differences are not terribly big, but between the slowest and the fastest function, the speedup is anyway about 6 times. Hope this helps.

Thomas Kühn
  • 9,412
  • 3
  • 47
  • 63