10

I have multiple time series in input and I want to properly build an LSTM model.

I'm really confused about how to choose the parameters. My code:

model.add(keras.layers.LSTM(hidden_nodes, input_shape=(window, num_features), consume_less="mem"))
model.add(Dropout(0.2))
model.add(keras.layers.Dense(num_features, activation='sigmoid'))

optimizer = keras.optimizers.SGD(lr=learning_rate, decay=1e-6, momentum=0.9, nesterov=True)

I want to understand, for each line, the meaning of the input parameters and how those have to be choosed.

Actually I don't have any problems with the code but I need to understand clearly the parameters in order to obtain better results.

Thanks a lot!

petezurich
  • 9,280
  • 9
  • 43
  • 57
Alessandro
  • 742
  • 1
  • 10
  • 34
  • 1
    Thats a very broad question that not directly refers to programming. Can you be more specific? What have you yourself tried to find out, where exactly do you struggle in understanding? You may want to look into this too: https://stackoverflow.com/questions/38714959/understanding-keras-lstms?rq=1 – petezurich Jul 24 '17 at 10:52
  • I reed the articles and I know that's a very broad question, but I'm searching for a general explanation of those parameters. I hope to collect experience from whom used it. – Alessandro Jul 24 '17 at 11:21

1 Answers1

36

This part of the keras.io documentation is quite helpful:

LSTM Input Shape: 3D tensor with shape (batch_size, timesteps, input_dim)

Here is also a picture that illustrates this: enter image description here

I will also explain the parameters in your example:

model.add(LSTM(hidden_nodes, input_shape=(timesteps, input_dim)))
model.add(Dropout(dropout_value))

hidden_nodes = This is the number of neurons of the LSTM. If you have a higher number, the network gets more powerful. Howevery, the number of parameters to learn also rises. This means it needs more time to train the network.

timesteps = the number of timesteps you want to consider. E.g. if you want to classify a sentence, this would be the number of words in a sentence.

input_dim = the dimensions of your features/embeddings. E.g. a vector representation of the words in the sentence

dropout_value = To reduce overfitting, the dropout layer just randomly takes a portion of the possible network connections. This value is the percentage of the considered network connections per epoch/batch.

As you can see, there is no need to specify the batch_size. Keras will automatically take care of it.

optimizer = keras.optimizers.SGD(lr=learning_rate, decay=1e-6, momentum=0.9, nesterov=True)

learning_rate = Indicates, how much the weights are updated per batch.

decay = How much the learning_reate decrease over time.

momentum = The rate of momentum. A higher value helps to overcome local minima and thus speed up the learning process. Further explanation.

nesterov = If nesterov momentum should be used. Here is a good explanation.

Alex
  • 1,652
  • 11
  • 18
  • What about SGD parameters? – Alessandro Jul 24 '17 at 15:46
  • Does SGD parameters influence classification results? – Alessandro Jul 24 '17 at 16:50
  • 1
    For sure, like every other hyperparameter. I typically prefer other optimizers, because they have improved SGD, like e.g. [adam](https://stats.stackexchange.com/questions/184448/difference-between-gradientdescentoptimizer-and-adamoptimizer-tensorflow) – Alex Jul 24 '17 at 17:01