1

Given the following dataset, I want to process it in order to be able to fit a RNN in Keras with shape (batch_size,timesteps,features). This is a simplified example of the dataset:

X = np.array([[1,2,3,4,5],[7,8,9,10,11],[12,13,14,15,16]]).T
data = pd.DataFrame(X,columns=['feature1','feature2','outcome'])

feature1    feature2    outcome
   1           7          12
   2           8          13
   3           9          14
   4          10          15
   5          11          16

I now want to create a numpy array that reflects a lag of 2 for outcome. My goal is to predict the outcome, given the values of the previous two timesteps.

That is, I want an array that looks like this.

batch_size = 3 # for this particular dataset
timesteps = 2
features = 2
out = np.empty(shape=(batch_size,timesteps,features))
out[0] = np.array([[1,7],[2,8]])
out[1] = np.array([[2,8],[3,9]])
out[2] = np.array([[3,9],[4,10]])
y = np.array([14,15,16])
print(out)

[[[ 1.  7.]
[ 2.  8.]]

[[ 2.  8.]
[ 3.  9.]]

[[ 3.  9.]
[ 4. 10.]]]

With the outcome represented as:

print(y)
[14 15 16]

As you can see, there are a total of 3 possible combinations (shape[0]), where each combination has 2 lags (shape[1]) and two features (shape[2]).

cs95
  • 379,657
  • 97
  • 704
  • 746
imarevic
  • 45
  • 5

1 Answers1

2

You may use numpy's stride_tricks here:

from numpy.lib.stride_tricks import as_strided
v = data.iloc[:, :2].values  

X = as_strided(
    v, shape=(v.shape[0] - 2, 2, v.shape[1]), strides=(8, ) + v.strides
)
y = data.iloc[2:, -1].values

X 
array([[[ 1,  7],
        [ 2,  8]],

       [[ 2,  8],
        [ 3,  9]],

       [[ 3,  9],
        [ 4, 10]]])

y
array([14, 15, 16])
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks. Can you tell me what exactly the `strides=(8,)` is for? – imarevic Mar 20 '18 at 14:54
  • @Ivan Little difficult to explain, so I'd recommend reading the discussion here: https://stackoverflow.com/questions/47483579/numpy-stride-tricks-returns-junk-values?noredirect=1&lq=1 – cs95 Mar 20 '18 at 14:55