3

I would like to use a RNN for time series prediction to use 96 backwards steps to predict 96 steps into the future. For this I have the following code:

#Import modules
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from tensorflow import keras

# Define the parameters of the RNN and the training
epochs = 1
batch_size = 50
steps_backwards = 96
steps_forward = 96
split_fraction_trainingData = 0.70
split_fraction_validatinData = 0.90
randomSeedNumber = 50
helpValueStrides =  int(steps_backwards /steps_forward)

#Read dataset
df = pd.read_csv('C:/Users1/Desktop/TestValues.csv', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0]}, index_col=['datetime'])

# standardize data

data = df.values
indexWithYLabelsInData = 0
data_X = data[:, 0:3]
data_Y = data[:, indexWithYLabelsInData].reshape(-1, 1)


scaler_standardized_X = StandardScaler()
data_X = scaler_standardized_X.fit_transform(data_X)
data_X = pd.DataFrame(data_X)
scaler_standardized_Y = StandardScaler()
data_Y = scaler_standardized_Y.fit_transform(data_Y)
data_Y = pd.DataFrame(data_Y)


# Prepare the input data for the RNN

series_reshaped_X =  np.array([data_X[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
series_reshaped_Y =  np.array([data_Y[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])


timeslot_x_train_end = int(len(series_reshaped_X)* split_fraction_trainingData)
timeslot_x_valid_end = int(len(series_reshaped_X)* split_fraction_validatinData)

X_train = series_reshaped_X[:timeslot_x_train_end, :steps_backwards] 
X_valid = series_reshaped_X[timeslot_x_train_end:timeslot_x_valid_end, :steps_backwards] 
X_test = series_reshaped_X[timeslot_x_valid_end:, :steps_backwards] 

   
Y_train = series_reshaped_Y[:timeslot_x_train_end, steps_backwards:] 
Y_valid = series_reshaped_Y[timeslot_x_train_end:timeslot_x_valid_end, steps_backwards:] 
Y_test = series_reshaped_Y[timeslot_x_valid_end:, steps_backwards:]                                
   
   
# Build the model and train it

np.random.seed(randomSeedNumber)
tf.random.set_seed(randomSeedNumber)

model = keras.models.Sequential([
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]),
keras.layers.SimpleRNN(10, return_sequences=True),
keras.layers.Conv1D(16, helpValueStrides, strides=helpValueStrides), 
keras.layers.TimeDistributed(keras.layers.Dense(1))
])

model.compile(loss="mean_squared_error", optimizer="adam", metrics=['mean_absolute_percentage_error'])
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_valid, Y_valid))

#Predict the test data
Y_pred = model.predict(X_test)

prediction_lastValues_list=[]

for i in range (0, len(Y_pred)):
  prediction_lastValues_list.append((Y_pred[i][0][1 - 1]))

# Create thw dataframe for the whole data
wholeDataFrameWithPrediciton = pd.DataFrame((X_test[:,1]))
wholeDataFrameWithPrediciton.rename(columns = {indexWithYLabelsInData:'actual'}, inplace = True)
wholeDataFrameWithPrediciton.rename(columns = {1:'Feature 1'}, inplace = True)
wholeDataFrameWithPrediciton.rename(columns = {2:'Feature 2'}, inplace = True)
wholeDataFrameWithPrediciton['predictions'] = prediction_lastValues_list
wholeDataFrameWithPrediciton['difference'] = (wholeDataFrameWithPrediciton['predictions'] - wholeDataFrameWithPrediciton['actual']).abs()
wholeDataFrameWithPrediciton['difference_percentage'] = ((wholeDataFrameWithPrediciton['difference'])/(wholeDataFrameWithPrediciton['actual']))*100


# Inverse the scaling (traInv: transformation inversed)

data_X_traInv = scaler_standardized_X.inverse_transform(data_X)
data_Y_traInv = scaler_standardized_Y.inverse_transform(data_Y)
series_reshaped_X_notTransformed =  np.array([data_X_traInv[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
X_test_notTranformed = series_reshaped_X_notTransformed[timeslot_x_valid_end:, :steps_backwards] 
predictions_traInv = scaler_standardized_Y.inverse_transform(wholeDataFrameWithPrediciton['predictions'].values.reshape(-1, 1))

edictions_traInv = wholeDataFrameWithPrediciton['predictions'].values.reshape(-1, 1)

# Create thw dataframe for the inversed transformed data
wholeDataFrameWithPrediciton_traInv = pd.DataFrame((X_test_notTranformed[:,0]))
wholeDataFrameWithPrediciton_traInv.rename(columns = {indexWithYLabelsInData:'actual'}, inplace = True)
wholeDataFrameWithPrediciton_traInv.rename(columns = {1:'Feature 1'}, inplace = True)
wholeDataFrameWithPrediciton_traInv['predictions'] = predictions_traInv
wholeDataFrameWithPrediciton_traInv['difference_absolute'] = (wholeDataFrameWithPrediciton_traInv['predictions'] - wholeDataFrameWithPrediciton_traInv['actual']).abs()
wholeDataFrameWithPrediciton_traInv['difference_percentage'] = ((wholeDataFrameWithPrediciton_traInv['difference_absolute'])/(wholeDataFrameWithPrediciton_traInv['actual']))*100
wholeDataFrameWithPrediciton_traInv['difference'] = (wholeDataFrameWithPrediciton_traInv['predictions'] - wholeDataFrameWithPrediciton_traInv['actual'])

Here you can have some test data (don't care about the actual values as I made them up, just the shape is important) Download test data

How can the output of the Y_pred data be interpreted? Which of those values yields me the predicted values 96 steps into the future? I have attached a screenshot of the 'Y_pred' data. One time with 5 output neurons in the last layer and one time only with 1. Can anyone tell me, how to interpret the 'Y_pred' data meaning what exactly is the RNN predicting? I can use any values in the output (last layer ) of the RNN model. The 'Y_pred' data always has the shape (Batch size of X_test, timesequence, Number of output neurons). My question is targeting at the last dimension. I thought that these might be the features, but this is not true in my case, as I only have 1 output features (you can see that in the shape of the Y_train, Y_test and Y_valid data).

enter image description here

**Reminder **: The bounty is expiring soon and unfortunately I still have not received any answer. So I would like to remind you on the question and the bounty. I'll highly appreciate every comment.

PeterBe
  • 700
  • 1
  • 17
  • 37
  • Could you show your model and how you train it? – The Guy with The Hat Nov 25 '21 at 10:48
  • @TheGuywithTheHat: Thanks Guy with the Hat for your comment. What exactly do you mean by "model" and "how you train it"? I have posted my code above. This is the entire code and an (almost) minimal reproducible example. I also included test data. So as far as I see it, the model is defined in the posted code beginning at the line `model = keras.models.Sequential([...` and also the command for training is defined in the lines `model.compile(loss="mean_squared_error", optimizer="adam", metrics=['mean_absolute_percentage_error'])` and `history = model.fit(X_train, Y_train, ...` – PeterBe Nov 26 '21 at 07:37
  • My bad, I somehow missed the scrollbar on that code block. – The Guy with The Hat Nov 26 '21 at 10:53
  • @TheGuywithTheHat: No problem. Could you have a look into my code and maybe tell me something about my question? I'll highly appreciate every comment. – PeterBe Nov 26 '21 at 13:32
  • I might be able to take a more thorough look later, but is it possible that you should have `:-steps_backwards` instead of `:steps_backwards` when making `X_train`, `X_valid`, and `X_test`? – The Guy with The Hat Nov 27 '21 at 02:48
  • @TheGuywithTheHat: Thanks guy with the hat for your comment and sorry for my late reply (I was quite busy during the weekend with other things). I tried what you suggested and the outcome is exactly the same. So there is not difference if I use `:-steps_backwards` instead of `:steps_backwards`. Anyways, the more important question (and the reason why I asked this question and will award a bounty) is what is the RNN really predicting? How can I interpret the outcome of the RNN. Would you mind giving me further insides on that topic? – PeterBe Nov 29 '21 at 08:04

2 Answers2

2

It may be useful to step through the model inputs/outputs in detail.

When using the keras.layers.SimpleRNN layer with return_sequences=True, the output will return a 3-D tensor where the 0th axis is the batch size, the 1st axis is the timestep, and the 2nd axis is the number of hidden units (in the case for both SimpleRNN layers in your model, 10).

The Conv1D layer will produce an output tensor where the last dimension becomes the number of hidden units (in the case for your model, 16), as it's just being convolved with the input.

keras.layers.TimeDistributed, the layer supplied (in the example provided, Dense(1)) will be applied to each timestep in the batch independently. So with 96 timesteps, we have 96 outputs for each record in the batch.

So stepping through your model:

model = keras.models.Sequential([
    keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
    keras.layers.SimpleRNN(10, return_sequences=True), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
    keras.layers.Conv1D(16, helpValueStrides, strides=helpValueStrides) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 16),
    keras.layers.TimeDistributed(keras.layers.Dense(1)) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)
])

To answer your question, the output tensor from your model contains the predicted values for 96 steps into the future, for each sample. If it's easier to conceptualize, for the case of 1 output, you can apply np.squeeze to the result of model.predict, which will make the output 2-D:

Y_pred = model.predict(X_test) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)
Y_pred_squeezed = np.squeeze(Y_pred) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS)

In that way, you have a rectangular matrix where each row corresponds to a sample in the batch, and each column i corresponds to the prediction for the timestep i.

In the loop after the prediction step, all the timestep predictions are being discarded except for the first one:

for i in range(0, len(Y_pred)):
    prediction_lastValues_list.append((Y_pred[i][0][1 - 1]))

which means the end result is just a list of predictions for the first timestep for each sample in the batch. If you wanted the prediction for the 96th timestep, you could do:

for i in range(0, len(Y_pred)):
    prediction_lastValues_list.append((Y_pred[i][-1][1 - 1]))

Notice the -1 instead of 0 for the second bracket, to ensure we grab the last predicted timestep instead of the first.

As a side note, to replicate the results, I had to make one change to your code, specifically when creating series_reshaped_X and series_reshaped_Y. I hit an exception when using np.array to create the array from the list: ValueError: cannot copy sequence with size 192 to array axis with dimension 3 , but looking at what you were doing (joining tensors along a new axis), I changed it to np.stack, which will accomplish the same goal (https://numpy.org/doc/stable/reference/generated/numpy.stack.html):

series_reshaped_X = np.stack([data_X[i:i + (steps_backwards + steps_forward)].copy() for i in
                              range(len(data) - (steps_backwards + steps_forward))])
series_reshaped_Y = np.stack([data_Y[i:i + (steps_backwards + steps_forward)].copy() for i in
                              range(len(data) - (steps_backwards + steps_forward))])

Update

"What are those 5 values representing when I only have 1 target feature?"

That's actually just the broadcasting feature of the Tensorflow API (which is also a feature of NumPy). If you perform an arithmetic operation on two tensors with differing shapes, it will try to make them compatible. In this case, if you change the output layer size to be "5" instead of "1" (keras.layers.Dense(5)), the output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 5) instead of (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1), which just means the output from the convolutional layer is going into 5 neurons instead of 1. When the loss (mean squared error) is computed between the two, the size of the label tensor ((BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)) is broadcast to the size of the prediction tensor ((BATCH_SIZE, NUMBER_OF_TIMESTEPS, 5)). In this case, the broadcasting is accomplished by replicating the column. For example, if Y_train had [-1.69862224] in the first row for the first timestep, and Y_pred had [-0.6132075 , -0.6621697 , -0.7712653 , -0.60011995, -0.48753992] in the first row for the first timestep, to perform the subtraction operation, the entry in Y_train is converted to [-1.69862224, -1.69862224, -1.69862224, -1.69862224, -1.69862224].

And which of those 5 values is the "correct" value to choose for the 96 time step ahead prediciton?

There is no real "correct" value when trained this way - as detailed above, this just a feature of the API. All output should converge to the single target value for the timestep, they're all being compared to that value, so you could technically train that way, but it's just adding parameters and complexity to the model (and you would just have to choose one to be the "real" prediction). The correct approach for getting the prediction for 96 timesteps ahead is detailed in the original answer, but just to reiterate, the output of the model contains future timestep predictions for each sample in the batch. The output tensor could be iterated over to retrieve the predictions for each timestep, for each sample. Furthermore, ensure the number of neurons in the final dense layer matches the number of target values you are trying to predict, otherwise you'll hit the broadcasting issue (and the "correct" output will be unclear).

Just to be exhaustive (and I am not recommending this), if you really wanted to incorporate several neurons in the output despite only having one target value, you could do something like averaging the results:

for i in range(0, len(Y_pred)):
    prediction_lastValues_list.append(np.mean(Y_pred[i][0]))

But there is absolutely no benefit to this approach, so I would recommend just sticking with the previous suggestion.

Update 2

Is my model only predicting one time slot which is 96 time steps into the future or is it also predicting everything in between? The model is predicting everything in between. So for a sample at timestep t, the output of the model are predictions [t + 1, t + 2, ..., t + NUMBER_OF_TIMESTEPS]. Per my original answer, "the output tensor from your model contains the predicted values for 96 steps into the future, for each sample". To specify that in your evaluation code, you can do something like:

Y_pred = np.squeeze(Y_pred)
predictions_for_all_samples_and_timesteps = Y_pred.tolist()

This results in a list of length BATCH_SIZE, and each element in the list is a list of length NUMBER_OF_TIMESTEPS (to be clear, predictions_for_all_samples_and_timesteps is a list of lists). The element at index i in predictions_for_all_samples_and_timesteps contains the predictions for each timestep from 1-96 for the i^th sample (row) in X_test.

As a side note, you could omit np.squeeze, but then you will have a list of lists of lists, where each element in the inner list is a list of one item (instead of [[1, 2, 3, ...], ], the output would look like [[[1], [2], [3], ...], ].

Update 3

Y_test and Y_pred are both 3-D numpy arrays of size (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1). To compare them, you can take the absolute (or squared) difference between the two:

abs_diff = np.abs(Y_pred - Y_test)

This results in an array of the same dimensions, (BATCH_SIZE, NUMBER_OF_TIMESTEPS). You can then iterate over the rows and generate a plot of the timestep error for each row.

for diff in abs_diff:
    print(diff.shape)
    plt.plot(list(range(diff)), diff)

enter image description here It may get a bit unwieldy with a large batch size (as you can see in the image), so maybe you plot a subset of the rows. You can also transform the absolute difference to an error percentage if you would prefer to plot that:

percentage_diff = abs_diff / Y_test

which would be the absolute difference over the actual value, as I see you were originally doing in Pandas. This numpy array will have the same dimensions, so you can iterate over it and generate plots in the same fashion.

For future inquiries, instead of posting the comments, please open a new question and just provide the link - I would be happy to continue helping, but I would like to continue gaining reputation from it.

danielcahall
  • 2,672
  • 8
  • 14
  • Thanks a lot danielcahall. I have 2 questions. 1) While I stronlgy appreciate your advice with the -1 in the line ` prediction_lastValues_list.append((Y_pred[i][-1][1 - 1]))`, my main question has still not been answered. Maybe you can have a look at the screenshot that I posted. When I use `keras.layers.TimeDistributed(keras.layers.Dense(5)` I get for every item in the batch and every item in the timesequence 5 values. What are those 5 values representing when I only have 1 target feature? And which of those 5 values is the "correct" value to choose for the 96 time step ahead prediciton? – PeterBe Nov 30 '21 at 17:20
  • 2) Regarding your side node, I do not really understand what the problem is. For me this problem has not occured. So why is it occuring when you run it? Maybe we use different versions of numpy? – PeterBe Nov 30 '21 at 17:20
  • Thanks for your great update. I really appreciate your effort. Maybe one last question (before I will accept your answer). Is my model only predicting one time slot which is 96 time steps into the future or is it also predicting everything in between? Meaning for example if the prediction is at 00:00 p.m., is the model only predicting 00:00 p.m. of the next day or also every time slot in between meaning {00:15, 00:30, ...,23:45, 00:00 (next day)}. Because that is actually what I want. And how can I specify that here `prediction_lastValues_list.append((Y_pred[i][-1][1 - 1]))` – PeterBe Dec 01 '21 at 10:32
  • Thanks for your great update and tremendous effort. I upvoted and accepted your answer and awarded the bounty to you. However, I still have some follow up question (but I could understand, if you don't want to answer them anymore). 1) I don't understand what `Y_pred = np.squeeze(Y_pred)` actually does? And does Y_pred now contain all the predictions for every timeslot (and without the squeezeit does not)? – PeterBe Dec 01 '21 at 17:06
  • 2) I would like to compare the predictions to the actual values. For that I use in the code the 2 columns of the dataframe `wholeDataFrameWithPrediciton_traInv.rename(columns = {indexWithYLabelsInData:'actual'}, inplace = True)` and `wholeDataFrameWithPrediciton_traInv['predictions'] = predictions_traInv`. I then calculate e.g. the RMSE error between the 2 columns. Is this considering also all the predicitons in between or only always the predicted time slot that is 96 steps away from the current time slot? – PeterBe Dec 01 '21 at 17:06
  • Actually what I want is to evaluate the predictions for all items in `BATCH_SIZE`. So each item in `BATCH_SIZE` should have a prediciton of `NUMBER_OF_TIMESTEPS` and these predictions should be compared to the actual values. And then the RMSE or the percentage deviation should be calculated for the whole predictions. – PeterBe Dec 01 '21 at 17:09
  • Thank you for the bounty reward! `Y_pred` contains the prediction values for every timestep regardless - `np.squeeze` is just for convenience to remove an axis in the tensor (which translates to the list of list of lists -> list of lists I mentioned). The output size of `Y_pred` is `(BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)` initially, and with `np.squeeze`, it becomes `(BATCH_SIZE, NUMBER_OF_TIMESTEPS)`. You can experiment with removing that function just to see what happens, if my description isn't clear enough. – danielcahall Dec 01 '21 at 19:33
  • Regarding the second question, the way the code was originally written, it will just compare to the prediction from the first timestep, since the column in the dataframe was just assigned to the prediction list extracted from the first column. By making that change to `-1` as detailed in the original answer, you can compare it to the last (96th) timestep. However, I think you would actually want to compare the output prediction tensor to `Y_test`, since that contains the ground truth labels for each timestep, for each sample in `X_test`. Does that make sense? – danielcahall Dec 01 '21 at 19:40
  • Thanks for your answer and effort. I really appreciate it. You are absolutely right, that I want to compare the "output prediction tensor to `Y_test`". The question is how can I do that? And basically I also would like to print the results in a diagramm. Currently I am doing this (using matplotlib) but only for the comparison of two time series `wholeDataFrameWithPrediciton_traInv['predictions']` and `wholeDataFrameWithPrediciton_traInv ['actual']`. Actually there should be BATCH_SIZE number of plots. – PeterBe Dec 02 '21 at 08:18
  • Thanks for your great answer danielcahall. Any comments on my last comment? I'll highly appreciate every further comment from you. – PeterBe Dec 03 '21 at 08:11
  • Any comments to my last comments? Especially about how to compare the "output prediction tensor to `Y_test`" as you correctly assumed my intention. – PeterBe Dec 06 '21 at 10:35
  • Thanks for your answer and comments. Just for your information: I asked a new (somehow related) question to this post that you can see here https://stackoverflow.com/questions/70361179/how-to-include-feature-values-in-a-time-series-prediction-of-a-rnn-in-keras – PeterBe Dec 15 '21 at 09:27
  • I carefully read again through your answer and tested different things and there is one cruical point that is striking. It is actually my core question and it is realted to your Update on ""What are those 5 values representing when I only have 1 target feature?". You wrote "All output should converge to the single target value for the timestep" --> This is not true at all in my example. The output values are extremely different (see also in the screenshot). So for me it is a valid question which of those 5 outputs shall I choose? If you want I can ask this as a new question. – PeterBe Dec 16 '21 at 13:53
  • The keyword is "should" - that is what should happen in the optimization process, as they are all being compared to the same value (the single target value) through the broadcasting operation. They won't be exactly the same, but they're being driven to the same target value. Also, to restate something I brought up before, if you have one target value per timestep, having 5 output values does not make sense, and while you can combine them in some way (i.e; averaging), it's just adding unnecessary complexity to the problem which can be mitigated by making the model consistent with your objective – danielcahall Dec 16 '21 at 14:52
  • Thanks a lot for your answer and effort. I really appreciate it. You wrote "They won't be exactly the same, but they're being driven to the same target value." --> Actually this is my point. I tested many datasets and configurations and in most cases the values are extremely different which brings my question up which one of those to choose. I understand that I should only have 1 output neuron (then there is no question which one to choose) but I am just curious about what the RNN is really predicting in case of many outputs because the differences are huge between them – PeterBe Dec 17 '21 at 08:18
  • Thanks for your answers and effort. Any comments to my last comment? I made a prediction and choose 90 output neurons for every timeslot. So the RNN predicts 90 values for each sequence. I just randomly picked 3 out of them and compared the minimum and maxium values. Here are the results : Timeslot 1: (min: -2810.8, max: 42750) --> Difference: 45561 (1520 %), Timeslot 2: (min: -2281, max: 53061) --> Difference: 55343 (2325%), Timeslot 3: (min: -6450.36 , max: 74780) --> Difference: 81230 (1159 %). So on average the difference is 60712 (1669%) which is ridiciously high. – PeterBe Dec 21 '21 at 09:37
  • Hi Daniel it's me again. I am still not really sure about my inital question. Do you have any comments to my last 2 comments? The bottom line is that the results of the different neurons are extremely different (between 1.500 % and 2.500 %). So what is the RNN really predicting? It can't be the same timeslot when the differences are so ridicously high. – PeterBe Jan 11 '22 at 13:42
  • Any comment to my last comments. I'll highly appreciate every further comment from you. – PeterBe Jan 17 '22 at 17:33
2

I disagree with @danielcahall on just one point:

The output tensor from your model contains the predicted values for 96 steps into the future, for each sample

The output does contain 96 time steps, one for each input time step, and you can take an output to mean whatever you want. But this is just not a good model for what you're trying to do. The main reason is that the RNNs you're using are single direction.

x   x   x   x   x   x    # input
|   |   |   |   |   | 
x-->x-->x-->x-->x-->x    # SimpleRNN
|   |   |   |   |   | 
x-->x-->x-->x-->x-->x    # SimpleRNN
|  /|\ /|\ /|\ /|\  | 
| / | \ | \ | \ | \ |
x   x   x   x   x   x    # Conv
|   |   |   |   |   | 
x   x   x   x   x   x    # Dense -> output

So the first time index of the output only sees the first 2 input times (thanks to the Conv), it can't see the later times. The first prediction is based only on old data. It's only the last few outputs that can see all the inputs.

use 96 backwards steps to predict 96 steps into the future

Most of the outputs just can't see all the data.

This model would be appropriate if you were trying to predict 1 step into the future from each of the input times.

To predict 96 steps into the future it would be much more reasonable to drop the return_sequences=True and the Conv layer. Then expand the Dense layer to make the prediction:

model = keras.models.Sequential([
    keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
    keras.layers.SimpleRNN(10), # output size is (BATCH_SIZE, 10)
    keras.layers.Dense(96) # output size is (BATCH_SIZE, 96)
])

That way all 96 predictions see all 96 inputs.

See https://www.tensorflow.org/tutorials/structured_data/time_series for more details.

Also SimpleRNN is terrible. Never use it over more than a couple of steps.

mdaoust
  • 6,242
  • 3
  • 28
  • 29
  • Thanks a lot mdaoust for your answer. I have some question/remarks on that. 1) You wrote "The first prediction is based only on old data. It's only the last few outputs that can see all the inputs." --> Actually all the predictions are based on old data. I want to use the past 96 time steps to predict the future 96 time steps. So it is totally intended that all the predictions only see old data. 2)Why shall I get rid of return_sequence=true in the second layer but not in the first? In machine learning books it is mentioned that return_sequence =true leads to better convergence in the training. – PeterBe Nov 30 '21 at 17:29
  • Any comments to my last comments or remarks? – PeterBe Dec 01 '21 at 17:09
  • Hi mdaoust. I have a further remark to your suggested approach: When I remove the Convolutional layer as you suggested and try to alter the past data that is used for the prediciton by setting `steps_backwards = 192`, I get the error message " ValueError: Dimensions must be equal, but are 192 and 96 for '{{node mean_squared_error/SquaredDifference}} = SquaredDifference[T=DT_FLOAT](sequential_11/time_distributed_11/Reshape_1, IteratorGetNext:1)' with input shapes: [?,192,1], [?,96,1]." thrown by the line `history = model.fit(...`. So without the Con layer I can't change this. – PeterBe Dec 02 '21 at 09:00
  • I have another 4th question to your answer. 4) When I use, as you suggested, in the last output layer `keras.layers.Dense(96) )` the output is, as you stated `(BATCH_SIZE, 96)` as opposed to `(BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)` when using `keras.layers.TimeDistributed(keras.layers.Dense(1)`. What I don't understand is that the label vector `Y_train` still has the size `(BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)` and the feature vector X_train `(BATCH_SIZE, NUMBER_OF_TIMESTEPS, NUMBER_OF_FEATURES)` – PeterBe Dec 03 '21 at 08:07
  • How is actually the training done with your code? A 3 dimensional vector X_train would be mapped to a 2 dimensional vector (that is the output of your code) instead of a 3 dimensional vector when using the time distributed layer. This seems very strange. Further, when using another output size in the last layer with your approach e.g. `keras.layers.Dense(5)` this becomes even more unclear. Would you mind elaborating a little bit more on that? I'll highly appreciate every further comment from you. – PeterBe Dec 03 '21 at 08:11
  • Thanks mdaoust for your answer. Any comments to my last comments? Actually I have 4 questions and remarks to your answer. I'll appreciate it if you try to answer them. – PeterBe Dec 06 '21 at 10:37
  • "Actually all the predictions are based on old data", yes. But look at how each piece of data moves through the network. Only the first couple of inputs are visible to the first output. look at the arrows in the ascii drawing. – mdaoust Dec 28 '21 at 01:08
  • "return_sequence =true leads to better convergence in the training". It depends on the situation. – mdaoust Dec 28 '21 at 01:08
  • "are 192 and 96 for '{{node mean_squared_error/SquaredDifference}}". The problem is exactly what it said. You need to be sure that the network's output is the same size as the dataset's label. – mdaoust Dec 28 '21 at 01:09
  • " dimensional vector X_train would be mapped to a 2 dimensional vector (that is the output of your code) instead of a 3 dimensional vector,... This seems very strange. ... when using another output size in the last layer with your approach e.g. keras.layers.Dense(5)". Yeah, more complete code would have been: `..., layers.Dense(TIMESTEPS*Out_FEATURES), layers.Reshape((TIMESTEPS, OUT_FEATURES)),...`. The dense layer produces all the features and times in one shot. – mdaoust Dec 28 '21 at 01:14