14

I need to extract all subsequences of a time series/array of a given window. For example:

>>> ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> window = 3
>>> subsequences(ts, window)
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4],
       [3, 4, 5],
       [4, 5, 6],
       [5, 6, 7],
       [5, 7, 8],
       [6, 8, 9]])

Naive methods that iterate over the sequence are of course expensive, for example:

def subsequences(ts, window):
    res = []
    for i in range(ts.size - window + 1):
        subts = ts[i:i+window]
        subts.reset_index(drop=True, inplace=True)
        subts.name = None
        res.append(subts)
    return pd.DataFrame(res)

I found a better way by copying the sequence, shifting it by a different value until the window is covered, and splitting the different sequences with reshape. Performance is around 100x better, because the for loop iterates over the window size, and not the sequence size:

def subsequences(ts, window):
    res = []
    for i in range(window):
        subts = ts.shift(-i)[:-(ts.size%window)].reshape((ts.size // window, window))
        res.append(subts)
    return pd.DataFrame(np.concatenate(res, axis=0))

I've seen that pandas includes several rolling functions in the pandas.stats.moment module, and I guess what they do is somehow similar to the subsequencing problem. Is there anywhere in that module, or anywhere else in pandas to make this more efficient?

Thank you!

UPDATE (SOLUTION):

Based on @elyase answer, for this specific case there is a slightly simpler implementation, let me write it down here, and explain what it's doing:

def subsequences(ts, window):
    shape = (ts.size - window + 1, window)
    strides = ts.strides * 2
    return np.lib.stride_tricks.as_strided(ts, shape=shape, strides=strides)

Given the 1-D numpy array, we first compute the shape of the resulting array. We will have a row starting at each position of the array, with just the exception of the last few elements, at which starting them there wouldn't be enough elements next to complete the window.

See on the first example in this description, how the last number we start at is 6, because starting at 7, we can't create a window of three elements. So, the number of rows is the size minus the window plus one. The number of columns is simply the window.

Next, the tricky part is telling how to fill the resulting array, with the shape we just defined.

To do we consider that the first element will be the first. Then we need to specify two values (in a tuple of two integers as the argument to the parameter strides). The values specify the steps we need to do in the original array (the 1-D one) to fill the second (the 2-D one).

Consider a different example, where we want to implement the np.reshape function, from a 9 elements 1-D array, to a 3x3 array. The first element fills the first position, and then, the one at its right, would be the next on the 1-D array, so we move 1 step. Then, the tricky part, to fill the first element of the second row, we should do 3 steps, from the 0 to the 4, see:

>>> original = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> new = array([[0, 1, 2],
                 [3, 4, 5],
                 [6, 7, 8])]

So, to reshape, our steps for the two dimensions would be (1, 3). For our case, where it exists overlap, it is actually simpler. When we move right to fill the resulting array, we start at the next position in the 1-D array, and when we move right, again we get the next element, so 1 step, in the 1-D array. So, the steps would be (1, 1).

There is only one last thing to note. The strides argument does not accept the "steps" we used, but instead the bytes in memory. To know them, we can use the strides method of numpy arrays. It returns a tuple with the strides (steps in bytes), with one element for each dimension. In our case we get a 1 element tuple, and we want it twice, so we have the * 2.

The np.lib.stride_tricks.as_strided function performs the filling using the described method without copying the data, which makes it quite efficient.

Finally, note that the function posted here assumes a 1-D input array (which is different from a 2-D array with 1 element as row or column). See the shape method of the input array, and you should get something like (N, ) and not (N, 1). This method would fail on the latter. Note that the method posted by @elyase handles two dimension input array (that's why this version is slightly simpler).

Marc Garcia
  • 3,287
  • 2
  • 28
  • 37
  • when you say the naive method is expensive I assume that you have actually profiled your program and that is indeed a bottleneck? – Joran Beasley Jan 09 '15 at 01:10
  • 1
    Yes, as I need to iterate over the whole sequence, there is no optimization in the computations, and it is slow. For a sequence of 4719 elements, and a window of 5, it takes around 700 milliseconds. The second approach, for the same data takes around 8 milliseconds. The question is if pandas (or numpy) can do that without needing to iterate at all, which should be still faster. – Marc Garcia Jan 09 '15 at 01:14
  • 1
    you might have better luck at codereview.stackexchange.com I would put your timing info up there in the question as well – Joran Beasley Jan 09 '15 at 01:21

3 Answers3

16

This is 34x faster than your fast version in my machine:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

>>> rolling_window(ts.values, 3)
array([[0, 1, 2],
      [1, 2, 3],
      [2, 3, 4],
      [3, 4, 5],
      [4, 5, 6],
      [5, 6, 7],
      [6, 7, 8],
      [7, 8, 9]])

Credit goes to Erik Rigtorp.

elyase
  • 39,479
  • 12
  • 112
  • 119
  • 1
    Thanks a lot elyase! Your solution is also faster in my machine, but it looks like most of the gain is because computations are performed in numpy, instead of pandas. If in your solution I convert the returning numpy array to a pandas DataFrame the gain is around 10%, which is far from the 34x, but it's good. If I convert my solution to numpy, the performance of your solution is still better, but just slightly. Let me leave the question still open, to see if there is still a faster solution. Thank you! – Marc Garcia Jan 09 '15 at 10:17
  • Is it possible change it to shift forward by `N` observations, as opposed to `1` (as implemented in your answer)? I played around a bit but could not manage to get it to work. – Zhubarb Aug 14 '15 at 13:34
  • 1
    Hi @Rhubarb, I played around with the code and made a [gist](https://gist.github.com/sa2812/1cc7889f10c4d340faf68cbe78fd94b9) to reflect the changes to the function above – sunny Jul 25 '17 at 07:33
  • @elyase Please How to make the overlap is 50%, I meant to make the stride equal to length of sequence /2 – Hana90 Aug 02 '18 at 21:09
  • I think is worth noting that "it is advisable to avoid as_strided when possible", as stated in [its own documentation](https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html) – user11696358 Jan 18 '22 at 09:12
1

It is worth noting that the stride tricks can have unintended consequences when working on the transformed array. It is efficient because it modifies the memory pointers without creating a copy of the original array. If you update any values in the returned array is changes the values in the original array, and vice-versa.

l = np.asarray([1,2,3,4,5,6,7,8,9])
_ = rolling_window(l, 3)
print(_)
array([[1, 2, 3],
   [2, 3, 4],
   [3, 4, 5],
   [4, 5, 6],
   [5, 6, 7],
   [6, 7, 8],
   [7, 8, 9]])

_[0,1] = 1000
print(_)
array([[   1, 1000,    3],
   [1000,    3,    4],
   [   3,    4,    5],
   [   4,    5,    6],
   [   5,    6,    7],
   [   6,    7,    8],
   [   7,    8,    9]])

# create new matrix from original array
xx = pd.DataFrame(rolling_window(l, 3))
# the updated values are still updated
print(xx)
      0     1  2
0     1  1000  3
1  1000     3  4
2     3     4  5
3     4     5  6
4     5     6  7
5     6     7  8
6     7     8  9

# change values in xx changes values in _ and l
xx.loc[0,1] = 100
print(_)
print(l)
[[  1 100   3]
 [100   3   4]
 [  3   4   5]
 [  4   5   6]
 [  5   6   7]
 [  6   7   8]
 [  7   8   9]]
[  1 100   3   4   5   6   7   8   9]

# make a dataframe copy to avoid unintended side effects
new = xx.copy()
# changing values in new won't affect l, _, or xx

Any values that are changed in the xx or _ or l show up in the other variables because they are all the same object in memory.

See numpy docs for more detail: numpy.lib.stride_tricks.as_strided

jkm
  • 31
  • 1
1

I'd like to note that PyTorch offers a single function for this problem which is as memory efficient as the current best solution when working with Torch tensors but is much simpler and more general (i.e. when working with multiple dimensions):

# Import packages
import torch
import pandas as pd
# Create array and set window size
ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
window = 3
# Create subsequences with converting to/from Tensor
ts_torch = torch.from_numpy(ts.values)  # convert to torch Tensor
ss_torch = ts_torch.unfold(0, window, 1) # create subsequences in-memory
ss_numpy = ss_torch.numpy() # convert Tensor back to numpy (obviously now needs more memory)
# Or just in a single line:
ss_numpy = torch.from_numpy(ts.values).unfold(0, window, 1).numpy()

The main point is the unfold function, see the PyTorch docs for detailed explanation. The converting back to numpy may not be required if you're ok to work directly with PyTorch tensors - in that case the solution is just as memory efficient. In my use case, I found it easier to first create subsequences (and to do other preprocessing) using Torch tensors, and use .numpy() on these tensors to convert to numpy as and when needed.

Olivier
  • 426
  • 3
  • 4