Unrolling, timesteps, batchsize and hidden unit

Question

I read this blog here to understand the theoretical background this but after reading here I am bit confused about what **1)timesteps, 2)unrolling, 3)number of hidden units and 4) batch size ** are ? Maybe someone could explain this on a code basis as well because when I look into the model config this code below does not unroll, but what is timestep doing in this case ? Lets say I have a data of length of 2.000 points, splitted into 40 time steps and one feature. E.g. hidden units are 100. batchsize is not defined, what is happening in the model ?

model = Sequential()
model.add(LSTM(100, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='tanh')))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])

history=model.fit(train, train, epochs=epochs, verbose=2, shuffle=False)

Is the code below still an encoder decode model without a RepeatVector?

model = Sequential()
model.add(LSTM(100, return_sequences=True, input_shape=(n_timesteps_in, n_features)))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='tanh')))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])

history=model.fit(train, train, epochs=epochs, verbose=2, shuffle=False)

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

"Unroll" is just a mechanism to process the LSTMs in a way that makes them faster by occupying more memory. (The details are unknown for me... but it certainly has no influence in steps, shapes, etc.)

When you say "2000 points split in 40 time steps", I have absolutely no idea of what is going on.

The data must be meaningfully structured and saying "2000" data points is really lacking a lot of information.

Data structured for LSTMs is:

I have a certain number of individual sequences (data evolving with time)
Each sequence has a number of time steps (measures in time)
In each step we measured a number of different vars with different meanings (features)

Example:

2000 users in a website
They used the site for 40 days
In each day I measured the number of times they clicked a button

I can plot how this data evolves with time daily (each day is a step)

So, if you have 2000 sequences (also called "samples" in Keras), each sequence with length of 40 steps, and one single feature per step, this will happen:

Dimensions

Batch size is defined as 32 by default in the fit method. The model will process batches containing 32 sequences/users until it reaches 2000 sequences/users.
input_shape will required to be (40,1) (free batch size to choose in fit)

Steps

Your LSTMs will try to understand how clicks vary in time, step by step. That's why they're recurrent, they calculate things for a step and feed these things into the next step, until all 40 steps are processed. (You won't see this processing, though, it's internal)

With return_sequences=True, you will get the output for all steps.
Without it, you will get only the output for the last step.

The model

The model will process 32 parallel (and independent) sequences/users together in each batch.

The first LSTM layer will process the entire sequence in recurrent steps and return a final result. (The sequence is killed, there are no steps left because you didn't use return_sequences=True)
- Output shape = (batch, 100)
You create a new sequence with RepeatVector, but this sequence is constant in time.
- Output shape = (batch, 40, 100)
The next LSTM layer processes this constant sequence and produces an output sequence, with all 40 steps
- Output shape = (bathc, 40, 100)
The TimeDistributed(Dense) will process each of these steps, but independently (in parallel), not recursively as the LSTMs would do.
- Output shape = (batch, 40, n_features)
The output will be a the total group of 2000 sequences (that were processed in groups of 32), each with 40 steps and n_features output features.

Cells, features, units

Everything is independent.

Input features is one thing, output features is another. There is no requirement for Dense to use the same number of features used in input_shape, unless that's what you want.

When you use 100 units in the LSTM layer, it will produce an output sequence of 100 features, shape (batch, 40, 100). If you use 200 units, it will produce an output sequence with 200 features, shape (batch, 40, 200). This is computing power. More neurons = more intelligence in the model.

Something strange in the model:

You should replace:

model.add(LSTM(100, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))

With only:

model.add(LSTM(100, return_sequences=True,input_shape=(n_timesteps_in, n_features)))

Not returning sequences in the first layer and then creating a constant sequence with RepeatVector is sort of destroying the work of your first LSTM.

great explanation. I just did not understand yet two things 1) so the classical picture of an lstm with unrolling has nothing to do with the timesteps in the data shape ? 2) when I remove the RepeatVector, is my model still than an encoder decoder model ? I try to predict noisy data without doing feature extraction manually — annstudent93, May 23 '18 at 10:45
Is there a reference paper for an lstm encoder decoder without the RepeatVector part ? I found only : Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation as a reference Paper for the RepeatVector case — annstudent93, May 23 '18 at 11:01
The classical picture of an LSTM shows how the LSTM works. It is "made of" time steps, whether it has the attribute "unroll" or not. When I said that, I meant: using "unroll" in keras won't change this idea at all. It will use timesteps and 3D data either way. — Daniel Möller, May 23 '18 at 11:05
Technically, anything you do that outputs the same dimensions as the input and you use for training with output data = input data is an autoencoder (encoder-decoder). — Daniel Möller, May 23 '18 at 11:06
Now choosing between "kill sequence + repeat vector" or just "keep sequence" is actually a free option. Which is better? That belongs to the unanswerable questions and depends on your purposes, your data, etc. — Daniel Möller, May 23 '18 at 11:07
Very very probably the model that keeps the sequences can learn more easily and bring more precise results (but maybe that is not what you need). I'd say that is a good option... generalize the features but don't touch the steps. — Daniel Möller, May 23 '18 at 11:09
now I am understanding it a bit more better, I will try both keeping or killing sequence. But I must admit with keeping my model trained better and loss decreased rapidly than with repeat vector. — annstudent93, May 23 '18 at 11:23
how can I look into the output of the model, I mean I give him [train,train] but can I see what the model actually learned ? — annstudent93, May 23 '18 at 13:07
What do you mean? Do you want to see the output? `model.predict(train)`. — Daniel Möller, May 23 '18 at 13:46
no I mean the part of the encoding output. so after the first layer. Another question is: The data has many fluctuations and I trained my model now with a very small part of the whole data and than tried to predict the whole data. Usually I expected that the model will not predict it completely right, but it did. My aim was to see the error in the prediction. So the model should only predict the seen part right but not the unseen part right. What am I doing wrong ? — annstudent93, May 23 '18 at 16:28
At the end my answer [here](https://stackoverflow.com/questions/38714959/understanding-keras-lstms/50235563#50235563), there is how you create an autoencoder so you can use each part of the model independently. (You will need `encoder.predict(train)`). — Daniel Möller, May 23 '18 at 16:31
Unfortunately this kind of detailed analysis is not possible without your entire code. This is probably subject for a new question. — Daniel Möller, May 23 '18 at 16:32
I tried to explain my question here with a new post: https://stackoverflow.com/questions/50493609/lstm-prediction-neural-network-too-good-expected-error — annstudent93, May 23 '18 at 16:51

Unrolling, timesteps, batchsize and hidden unit

1 Answers1

Something strange in the model: