2D array of subseries in NumPy

Question

I'm trying to implement the R function embed in Python. It processes a 1D array and creates a matrix, whose columns are subseries of the original array such that their length is size - columns. It is perhaps best explained with an example.

def embed(data, n):
    return np.array([data[n-1::-1]] + [
        data[k:i:-1] for i, k in zip(range(0, data.size - n), range(n, data.size))
    ])

>>> x = np.arange(10)
>>> y = embed(x, 3)
>>> y
array([[2, 1, 0],
       [3, 2, 1],
       [4, 3, 2],
       [5, 4, 3],
       [6, 5, 4],
       [7, 6, 5],
       [8, 7, 6],
       [9, 8, 7]])

Here I create slices of data row by row and stack them on top of one another. The problem is, this is very inefficient. I have thousands of data points, so this operation is in the order of seconds, which is not acceptable. The contents of the array are not to be modified, they are passed on to sklearn and copied. So I think indexing these short slices and copying the whole thing with np.array creates loads of unnecessary overhead.

This seems like the kind of operation that is already implemented or at least very easily constructed using better methods, but at least embed produces no results when searching. How could this be achieved?

Most of the performance hit was eliminated by producing the columns first and transposing the result:

def embed(data, n):
    return np.array([
        data[i:k+1] for i, k in zip(range(0, n), range(data.size - n, data.size))
    ][::-1]).T

Still, I wonder if there's a more convenient way of constructing such an array in numpy.

if you have or don't mind installing scikit-image: `skimage.util.view_as_windows` — Paul Panzer, Mar 09 '19 at 22:47
If you do not want the dependency, what this function does is essentially (for your example with `a = np.arange(10)`): ` np.lib.stride_tricks.as_strided(a[2:], (8, 3), (a.strides[0], -a.strides[0]))` — Paul Panzer, Mar 09 '19 at 23:01

2D array of subseries in NumPy

0 Answers0