What is the correct LSTM input shape for a multivariate time-series?

Question

I have a multivariate timeseries with 5 features (5 independent timeseries). With these 5 features, I want to predict a target y which is also a timeseries. I have some ~50,000 observations in the data.

data  # pandas dataframe
>>>
            x1        x2        x3        x4        x5        y
time
18:20:00    0.462     -0.248   -0.873    0.892      0.012     0.938
18:21:00    0.621     -0.399   -0.772    0.891      0.008     0.922
18:22:00    0.726     -0.401   -0.771    0.899      0.009     0.910
... # 50,000 rows

I have seen multiple different sources, blogs and papers that all use slightly different setups and architectures when constructing an LSTM [1][2][3][4].

I understand that the input has to be of shape [n_samples, n_timesteps, n_features]. I believe that two of the three dimensions are already answered from my data; n_samples=50000 and n_features=5 since I have 50k samples and 5 features. The dimension n_timesteps is the confusing one.

If I reshape my data so that each "sample" input into the LSTM has overlapping time observations as such

import numpy as np

def reshape_data(data, n_steps):
    out = np.empty((data.shape[0] - n_steps + 1, n_steps, data.shape[1]))
    for i in range(data.shape[0] - n_steps + 1):
        out[i] = data[i: i + n_steps, :]
    return out

n_step = 10
reshaped_data = (data[["x1", "x2", "x3", "x4", "x5"]].values, n_step)
reshaped_data.shape
>>> (49991, 10, 5)

# target variable also needs to be truncated
# to match the shape and time index of `reshaped_data`
target = data["y"].values
target = target[n_step - 1:]
target.shape
>>> (49991,)

The data is now essentially a "3d tensor", or a matrix of matrices where each matrix contains 10 observations in time. Matrix one contains observations from time t0 to time t9 inclusive, then matrix two contains observations from time t1 to t10 inclusive and so on, so we can see each sample has 9 overlapping observations with the last. Each of these matrices is a sample given to the LSTM - there are n_samples of these matrices (49991 in our example). Now [n_samples, n_timesteps, n_features] = (49991, 10, 5).

I can now input the above data into an LSTM

model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(reshaped_data.shape[1], reshaped_data.shape[2])))
model.add(Dropout(0.4))
model.add(LSTM(50, return_sequences=False))
model.add(Dropout(0.4))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.fit(reshaped_data, target, epochs=50, batch_size=32, verbose=1, shuffle=False)

Alternatively, I could just take the original data and shape it as the following, where there are no overlapping observations in time for each sample

reshaped_data = data[["x1", "x2", "x3", "x4", "x5"]].values
reshaped_data = reshaped_data.reshape((reshaped_data.shape[0], 1, reshaped_data[1]))
reshaped_data.shape
(50000, 1, 5)

Now I can also give this data to an LSTM. In this case, [n_samples, n_timesteps, n_features] = (50000, 1, 5), so each sample only has one observation in time.

Here is my confusion

From the two methods of reshaping data above, which one is correct? Which one should lead to better results (higher accuracy)? What is the difference when training an LSTM with either reshaping method?
How does batch_size effect training with either data rehsaping method above?

When you say 50,000 samples, do you mean 50,000 entries in the `time` column? Or do you mean 50,000 sequences, each with a different number of entries in `time`? — Susmit Agrawal, Oct 17 '20 at 17:58
In that case, `n_timesteps` is 50,000, since you only have one series. `n_samples` is therefore 1. — Susmit Agrawal, Oct 17 '20 at 18:00

What is the correct LSTM input shape for a multivariate time-series?

0 Answers0

Linked