I have a multivariate timeseries with 5 features (5 independent timeseries). With these 5 features, I want to predict a target y
which is also a timeseries. I have some ~50,000 observations in the data.
data # pandas dataframe
>>>
x1 x2 x3 x4 x5 y
time
18:20:00 0.462 -0.248 -0.873 0.892 0.012 0.938
18:21:00 0.621 -0.399 -0.772 0.891 0.008 0.922
18:22:00 0.726 -0.401 -0.771 0.899 0.009 0.910
... # 50,000 rows
I have seen multiple different sources, blogs and papers that all use slightly different setups and architectures when constructing an LSTM [1][2][3][4].
I understand that the input has to be of shape [n_samples, n_timesteps, n_features]
. I believe that two of the three dimensions are already answered from my data; n_samples=50000
and n_features=5
since I have 50k samples and 5 features. The dimension n_timesteps
is the confusing one.
If I reshape my data so that each "sample" input into the LSTM has overlapping time observations as such
import numpy as np
def reshape_data(data, n_steps):
out = np.empty((data.shape[0] - n_steps + 1, n_steps, data.shape[1]))
for i in range(data.shape[0] - n_steps + 1):
out[i] = data[i: i + n_steps, :]
return out
n_step = 10
reshaped_data = (data[["x1", "x2", "x3", "x4", "x5"]].values, n_step)
reshaped_data.shape
>>> (49991, 10, 5)
# target variable also needs to be truncated
# to match the shape and time index of `reshaped_data`
target = data["y"].values
target = target[n_step - 1:]
target.shape
>>> (49991,)
The data is now essentially a "3d tensor", or a matrix of matrices where each matrix contains 10 observations in time. Matrix one contains observations from time t0
to time t9
inclusive, then matrix two contains observations from time t1
to t10
inclusive and so on, so we can see each sample has 9 overlapping observations with the last. Each of these matrices is a sample given to the LSTM - there are n_samples
of these matrices (49991 in our example). Now [n_samples, n_timesteps, n_features] = (49991, 10, 5)
.
I can now input the above data into an LSTM
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(reshaped_data.shape[1], reshaped_data.shape[2])))
model.add(Dropout(0.4))
model.add(LSTM(50, return_sequences=False))
model.add(Dropout(0.4))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.fit(reshaped_data, target, epochs=50, batch_size=32, verbose=1, shuffle=False)
Alternatively, I could just take the original data and shape it as the following, where there are no overlapping observations in time for each sample
reshaped_data = data[["x1", "x2", "x3", "x4", "x5"]].values
reshaped_data = reshaped_data.reshape((reshaped_data.shape[0], 1, reshaped_data[1]))
reshaped_data.shape
(50000, 1, 5)
Now I can also give this data to an LSTM. In this case, [n_samples, n_timesteps, n_features] = (50000, 1, 5)
, so each sample only has one observation in time.
Here is my confusion
- From the two methods of reshaping data above, which one is correct? Which one should lead to better results (higher accuracy)? What is the difference when training an LSTM with either reshaping method?
- How does
batch_size
effect training with either data rehsaping method above?