Understanding stateful LSTM

Question

I'm going through this tutorial on RNNs/LSTMs and I'm having quite a hard time understanding stateful LSTMs. My questions are as follows :

1. Training batching size

In the Keras docs on RNNs, I found out that the hidden state of the sample in i-th position within the batch will be fed as input hidden state for the sample in i-th position in the next batch. Does that mean that if we want to pass the hidden state from sample to sample we have to use batches of size 1 and therefore perform online gradient descent? Is there a way to pass the hidden state within a batch of size >1 and perform gradient descent on that batch ?

2. One-Char Mapping Problems

In the tutorial's paragraph 'Stateful LSTM for a One-Char to One-Char Mapping' were given a code that uses batch_size = 1 and stateful = True to learn to predict the next letter of the alphabet given a letter of the alphabet. In the last part of the code (line 53 to the end of the complete code), the model is tested starting with a random letter ('K') and predicts 'B' then given 'B' it predicts 'C', etc. It seems to work well except for 'K'. However, I tried the following tweak to the code (last part too, I kept lines 52 and above):

    # demonstrate a random starting point
    letter1 = "M"
    seed1 = [char_to_int[letter1]]
    x = numpy.reshape(seed, (1, len(seed), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    print(int_to_char[seed1[0]], "->", int_to_char[index])
    letter2 = "E"
    seed2 = [char_to_int[letter2]]
    seed = seed2
    print("New start: ", letter1, letter2)
    for i in range(0, 5):
        x = numpy.reshape(seed, (1, len(seed), 1))
        x = x / float(len(alphabet))
        prediction = model.predict(x, verbose=0)
        index = numpy.argmax(prediction)
        print(int_to_char[seed[0]], "->", int_to_char[index])
        seed = [index]
    model.reset_states()

and these outputs:

    M -> B
    New start: M E
    E -> C
    C -> D
    D -> E
    E -> F

It looks like the LSTM did not learn the alphabet but just the positions of the letters, and that regardless of the first letter we feed in, the LSTM will always predict B since it's the second letter, then C and so on.

Therefore, how does keeping the previous hidden state as initial hidden state for the current hidden state help us with the learning given that during test if we start with the letter 'K' for example, letters A to J will not have been fed in before and the initial hidden state won't be the same as during training ?

3. Training an LSTM on a book for sentence generation

I want to train my LSTM on a whole book to learn how to generate sentences and perhaps learn the authors style too, how can I naturally train my LSTM on that text (input the whole text and let the LSTM figure out the dependencies between the words) instead of having to 'artificially' create batches of sentences from that book myself to train my LSTM on? I believe I should use stateful LSTMs could help but I'm not sure how.

For future reference, this could have been split up into three separate questions. Additionally, the last question would have been more appropriate for stats.stackexchange.com. Finally, you shouldn't put the tag in the question title. — Seanny123, May 30 '17 at 02:46

jdehesa · Answer 1 · 2020-08-17T15:13:42.307

Having a stateful LSTM in Keras means that a Keras variable will be used to store and update the state, and in fact you could check the value of the state vector(s) at any time (that is, until you call reset_states()). A non-stateful model, on the other hand, will use an initial zero state every time it processes a batch, so it is as if you always called reset_states() after train_on_batch, test_on_batch and predict_on_batch. The explanation about the state being reused for the next batch on stateful models is just about that difference with non-stateful; of course the state will always flow within each sequence in the batch and you do not need to have batches of size 1 for that to happen. I see two scenarios where stateful models are useful:

You want to train on split sequences of data because these are very long and it would not be practical to train on their whole length.
On prediction time, you want to retrieve the output for each time point in the sequence, not just at the end (either because you want to feed it back into the network or because your application needs it). I personally do that in the models that I export for later integration (which are "copies" of the training model with batch size of 1).

I agree that the example of an RNN for the alphabet does not really seem very useful in practice; it will only work when you start with the letter A. If you want to learn to reproduce the alphabet starting at any letter, you would need to train the network with that kind of examples (subsequences or rotations of the alphabet). But I think a regular feed-forward network could learn to predict the next letter of the alphabet training on pairs like (A, B), (B, C), etc. I think the example is meant for demonstrative purposes more than anything else.
You may have probably already read it, but the popular post The Unreasonable Effectiveness of Recurrent Neural Networks shows some interesting results along the lines of what you want to do (although it does not really dive into implementation specifics). I don't have personal experience training RNN with textual data, but there is a number of approaches you can research. You can build character-based models (like the ones in the post), where your input and receive one character at a time. A more advanced approach is to do some preprocessing on the texts and transform them into sequences of numbers; Keras includes some text preprocessing functions to do that. Having one single number as feature space is probably not going to work all that well, so you could simply turn each word into a vector with one-hot encoding or, more interestingly, have the network learn the best vector representation for each for, which is what they call en embedding. You can go even further with the preprocessing and look into something like NLTK, specially if you want to remove stop words, punctuation and things like that. Finally, if you have sequences of different sizes (e.g. you are using full texts instead of excerpts of a fixed size, which may or may not be important for you) you will need to be a bit more careful and use masking and/or sample weighting. Depending on the exact problem, you can set up the training accordingly. If you want to learn to generate similar text, the "Y" would be the similar to the "X" (one-hot encoded), only shifted by one (or more) positions (in this case you may need to use return_sequences=True and TimeDistributed layers). If you want to determine the autor, your output could be a softmax Dense layer.

Hope that helps.

Yes, question 2 is only for the sake of learning, but I was wondering throughout that example how keeping the previous hidden state as initial hidden state for the next sample helps us, given that during test we won't necessarily have that context. It actually seems to lower the performance rather than improving it since the weights we learn are learnt with the wrong hidden state (especially for the first few elements of the sequences). — H.M., Jan 19 '17 at 18:43
"_of course the state will always flow within the batch_" Why would the state flow within a batch?! The samples tend to be independent, especially if shuffled. — Unknown, Aug 14 '20 at 19:58
@Unknown I think the assumption I was making here is that you have a batch with shape `(batch_size, sequence_length, num_features)`, and what I meant is that the state always flows throughout the second dimension, that is, within the same sequence, not between different sequences. Whether the recurrent layer is stateful or not, the state should always flow within a single batch (and with a stateful layer you can make it flow to the next one). — jdehesa, Aug 17 '20 at 09:12
"_the state should always flow within a single batch_" I think it would be best to replace "batch" with "sample" in this sentence :) Because as you clarified your assumption, a batch can (and often does) have multiple samples. — Unknown, Aug 17 '20 at 14:57

Understanding stateful LSTM

1. Training batching size

2. One-Char Mapping Problems

3. Training an LSTM on a book for sentence generation

1 Answers1

Linked