Understanding the functioning of a recurrent neural network with LSTM cells

Question

Context:

I have a recurrent neural network with LSTM cells
The input to the network is a batch of size (batch_size, number_of_timesteps, one_hot_encoded_class) in my case (128, 300, 38)
The different rows of the batch (1-128) are not necessarily related to each other
The target for one time step is given by the value of the next time step.

My questions: When I train the network using an input batch of (128,300,38) and a target batch of the same size,

does the network always consider only the last time-step t to predict the value of the next timestep t+1?
or does it consider all time steps from the beginning of the sequence up to time step t?
or does the LSTM cell internally remember all previous states?

I am confused about the functioning because the network is trained on multiple time steps simulatenously so I am not sure how the LSTM cell can still have knowledge of the previous states.

I hope somebody can help. Thanks in advance!

Code for dicussion:

            cells = []

            for i in range(self.n_layers):
                cell = tf.contrib.rnn.LSTMCell(self.n_hidden)
                cells.append(cell)

            cell = tf.contrib.rnn.MultiRNNCell(cells)
            init_state = cell.zero_state(self.batch_size, tf.float32)

            outputs, final_state = tf.nn.dynamic_rnn(
                cell, inputs=self.inputs, initial_state=init_state)

            self.logits = tf.contrib.layers.linear(outputs, self.num_classes)

            softmax_ce = tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels=labels, logits=self.logits)

            self.loss = tf.reduce_mean(softmax_ce)
            self.train_step = tf.train.AdamOptimizer(self.lr).minimize(self.loss)

score 1 · Answer 1 · answered Jul 11 '17 at 09:12

1

The above is a simple RNN unrolled to the neuron level with 3 time steps.

As you can see that the output at time step t, depends upon all time steps from the beginning. The network is trained using back-propagation through time where the weights are updated by the contribution of all error gradients across time. The weights are shared across time, so there is nothing like simultaneous update on all time steps.

The knowledge of the previous states are transfered through the state variable s_t as it is a function of previous inputs. So at any time step, the prediction is made based on the current input as well as (function of) previous inputs captured by the state variable.

NOTE: A basic rnn was used instead of LSTM because of simplicity.

answered Jul 11 '17 at 09:12

Vijay Mariappan

16,921
3
40
59

As a comment, when you use a basic RNN to train above time, first t instants are forgoten after some interations, something that doesn't happens with LSTM. – Shirkam Jul 11 '17 at 12:56
To make sure that I understood correctly: there is no simulatenous training on all time steps but the dynamic rnn starts with the initial_state $t=0$ and predits a value for $t+1$. The information about this transition is saved in the state of the LSTM cell. At $t+1$ the network receives the next input and tries to predict the value of time step $t+2$. Again, the information about the new input will be stored in the state of the LSTM cell. This process is repeated for all 300 time steps such that the network has knowledge about all previous steps when predicting the value of step $t+n$ – Lemon Jul 11 '17 at 13:26
In the end, the predictions of the network are compared to the target batch. Is this correct? – Lemon Jul 11 '17 at 13:27
yes, you are right. This is a basic RNN (not a LSTM),but the principles are same. As in the end the predictions at each stage are compared with the target batch. – Vijay Mariappan Jul 11 '17 at 17:02

score 0 · Accepted Answer · answered Jul 11 '17 at 10:34

0

Here's what would be helpful to keep in mind for your case specifically:

Given the input shape of [128, 300, 38]

One call to dynamic_rnn will propagate through all 300 steps, and if you are using something like LSTM, the state will also be carried through those 300 steps
However, each SUBSEQUENT call to dynamic_rnn will not automatically remember the state from the previous call. By the second call, the weights/etc. will have been updated thanks to the first call, but you will still need to pass the state that resulted from the first call into the second call. That's why dynamic_rnn has a parameter initial_state and that's why one of its outputs is final_state (i.e. the state after processing all 300 steps in ONE call). So you are meant to take the final state from call N and pass it back as the initial state for call N+1 to dynamic_rnn. This allrelates specifically to LSTM, since this is what you asked for
You are right to note that elements in one batch don't necessarily relate to each other within the same batch. This is something you need to consider carefully. Because with successive calls to dynamic_rnn, batch elements in your input sequences have to relate to their respective counterparts in the previous/following sequence, but not to each other. I.e. element 3 in the first call may have nothing to do with the other 127 elements within the same batch, but element 3 in the NEXT call has to be the temporal/logical continuation of element 3 in the PREVIOUS call, and so forth. This way, the state that you keep passing forward makes sense continuously

answered Jul 11 '17 at 10:34

VS_FF

2,353
3
16
34

To make sure that I understood correctly: there is no simulatenous training on all time steps but the dynamic rnn starts with the initial_state $t=0$ and predits a value for $t+1$. The information about this transition is saved in the state of the LSTM cell. At $t+1$ the network receives the next input and tries to predict the value of time step $t+2$. Again, the information about the new input will be stored in the state of the LSTM cell. This process is repeated for all 300 time steps such that the network has knowledge about all previous steps when predicting the value of step $t+n$ – Lemon Jul 11 '17 at 13:11
In the end, the predictions of the network are compared to the target batch. Is this correct? – Lemon Jul 11 '17 at 13:12
Also, I have a question left regarding your second bullet point. I have added some code in the original question. My model is defined as a class but the attributes should become clear by their name. As you can see I am providing an initial state to the network. However, I am not sure how I should pass the final state back into the network when training it. At the moment, in the training function I simply call self.train_step multiple times in a loop, always passing a new input batch to the network. – Lemon Jul 11 '17 at 13:13
Although you outlined that the next batch has to be related to the previous batch I don't think that's the case with my data. row 1 in the first batch is not necessarily related to row 1 in the second batch. Also, the second dimension of the input batches, i.e. number_of_timesteps varies between the batches. It always depends on the length of the longest sequence in the batch. Do I still have to pass the final state back into the network? – Lemon Jul 11 '17 at 13:13
I think in your case, the target for each step in the batch should be the same exact batch but shifted by 1. So at each step dynamic_rnn will compare the actual and the target values and adjust the weights. Also at each step, the state will be updated and propagated to the next step. – VS_FF Jul 11 '17 at 13:19
Regarding your subsequent questions, since you are saying that sequences in successive batches don't related to each other in logical/temporal way, then I'm not sure how much sense it would make to pass the final output state from one call as the input state to the following call to dynamic_rnn. You need to think about this one carefully. But if you INSIST, then you have to re-feed that state as initial state to the next call. It's a painful procedure, but I'll paste a good link that describes how to do it in the next comment – VS_FF Jul 11 '17 at 13:21
Some answers here cover it well, but once again, think first whether this makes sense for you -- if data is unrelated then no point in passing on the state: https://stackoverflow.com/questions/39112622/how-do-i-set-tensorflow-rnn-state-when-state-is-tuple-true – VS_FF Jul 11 '17 at 13:23
Finally, regarding your last question, it's OK that the time-length of various members of one batch is unequal. BUT make sure to specify this in the sequence_length parameter of dynamic_rnn, because otherwise the function will pad everything with zeros and will run each sequence to the maximum of what's in the batch, not only wasting time but more worringly updating the state incorrectly. – VS_FF Jul 11 '17 at 13:24
Thanks, that helped a lot. I will try out how it works best! – Lemon Jul 11 '17 at 13:27

Understanding the functioning of a recurrent neural network with LSTM cells

2 Answers2