I'm trying to implement the R function embed
in Python. It processes a 1D array and creates a matrix, whose columns are subseries of the original array such that their length is size - columns
. It is perhaps best explained with an example.
def embed(data, n):
return np.array([data[n-1::-1]] + [
data[k:i:-1] for i, k in zip(range(0, data.size - n), range(n, data.size))
])
>>> x = np.arange(10)
>>> y = embed(x, 3)
>>> y
array([[2, 1, 0],
[3, 2, 1],
[4, 3, 2],
[5, 4, 3],
[6, 5, 4],
[7, 6, 5],
[8, 7, 6],
[9, 8, 7]])
Here I create slices of data row by row and stack them on top of one another. The problem is, this is very inefficient. I have thousands of data points, so this operation is in the order of seconds, which is not acceptable. The contents of the array are not to be modified, they are passed on to sklearn
and copied. So I think indexing these short slices and copying the whole thing with np.array
creates loads of unnecessary overhead.
This seems like the kind of operation that is already implemented or at least very easily constructed using better methods, but at least embed
produces no results when searching. How could this be achieved?
Most of the performance hit was eliminated by producing the columns first and transposing the result:
def embed(data, n):
return np.array([
data[i:k+1] for i, k in zip(range(0, n), range(data.size - n, data.size))
][::-1]).T
Still, I wonder if there's a more convenient way of constructing such an array in numpy
.