Problem with text prediction LSTM neural networks

Question

I'm trying to do text prediction using recurrent neural networks (LSTM) with dataset from books. It doesn't matter how much I try changing layers size or other parameters, it always overfits.

I've been trying changing amount of layers, amount of units in LSTM layer, regularization, normalization, batch_size, shuffle training data/validation data, change dataset to bigger. For now I try with ~140kb txt book. I have also tried 200kb, 1mb, 5mb.

Creating training/validation data:

sequence_length = 30

x_data = []
y_data = []

for i in range(0, len(text) - sequence_length, 1):
    x_sequence = text[i:i + sequence_length]
    y_label = text[i + sequence_length]

    x_data.append([char2idx[char] for char in x_sequence])
    y_data.append(char2idx[y_label])

X = np.reshape(x_data, (data_length, sequence_length, 1))
X = X/float(vocab_length)
y = np_utils.to_categorical(y_data)

# Split into training and testing set, shuffle data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, shuffle=False)

# Shuffle testing set
X_test, y_test = shuffle(X_test, y_test, random_state=0)

Creating model:

model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True, recurrent_initializer='glorot_uniform', recurrent_dropout=0.3))
model.add(LSTM(256, return_sequences=True, recurrent_initializer='glorot_uniform', recurrent_dropout=0.3))
model.add(LSTM(256, recurrent_initializer='glorot_uniform', recurrent_dropout=0.3))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))

Compile model:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

I get following characteristics:

I don't know what to do about this overfitting, because I am searching internet, trying many things but none of them seems to work.

How could I get better results? These prediction doesn't seem to be good right now.

i am not sure what the question is **but** a model always overfit if you have enough epoch. That is why you do an early stop based on the validation set. You can try to delay the overfit, reduce the overfit, analyse the overfit but you model will always be somewhere between overfit and underfit (both happening regarding to sub sets of your data set). — Frayal, Nov 08 '19 at 10:28
Early stopping at epoch 10? Also did you try very extreme regularization or dropout? But also also, your validation error isn't increasing so your network isn't getting worse. You might not be overfitting, but might have simply extracted as much knowledge as is available in your dataset. Your validation error is never going to be as good as your training error. Your charts show that yes, it's learning the noise in the training maybe, but this isn't coming at the expense of a worse model out of sample. A classic overfit would show that blue line starting to curve back up (or down for accuracy). — Dan, Nov 08 '19 at 10:29
As @Dan correctly implies, you are **not** overfitting; this would require your val loss to start *increasing* (accuracy decreasing), which clearly doesn't happen here (see [this thread](https://stackoverflow.com/questions/54041867/are-my-training-and-validation-code-tensorflow-right-and-does-the-model-overfi/54042749)). Your model just saturates, unable to learn further (but w/o getting worse either). Also, dropout should not be used *by default* - it can lead to [performance degradation](https://stackoverflow.com/questions/57894274/reducing-versus-delaying-overfitting-in-neural-network). — desertnaut, Nov 08 '19 at 10:50
i early stop at 5 if needed...if you really want to get a 30 epoch early stopping you can always reduce the learning rate or the ES epsilon but meh it doesn't change a thing to the overall end results. ES works for overfit and saturation (did not know the exact word but this one is very clear) so it good practice to use it not matter what. — Frayal, Nov 08 '19 at 10:54
@michalovsky How many parameters do you have in your network? Please see in model.summary(). It's possible you should reduce number of stacked LSTMs from 3 to 1 or 2 will works for you. — pawols, Nov 08 '19 at 11:01
I updated post with model summary. The problem is when I add more layers/unit it really overfits validation characteristics going to other direction but gemerated text is a lot better. For this moment this model doesn't generate text in the acceptable way. I need to get rid of saturation. — michalovsky, Nov 08 '19 at 11:10

score 1 · Accepted Answer · answered Nov 09 '19 at 03:55

Here are some of the things that I would try next. (I am also an amateur. Please correct me if I am wrong)

Try to extract vector representation from the text. Try out word2vec, GloVe, FastText, ELMo. Extract vector representation and then feed them into the network. You could also create an embedding layer to help with that. This blog has more information.
256 Recurrent units might be too much. I think that one should never start with a huge network. Start small. See if you are underfitting. If yes, then go larger.
Switch out the optimizer. I find that Adam tends to overfit. I had better success with rmsprop and Adadelta.
Perhaps, attention is all you need? Transformers have recently made massive contributions to NLP. Perhaps you could try implementing simple soft attention mechanism in your network. Here is a nice video series if you are not already familiar. An interactive research paper on it.
CNN's are also pretty dope in NLP applications. Although they intuitively don't make any sense for text data (to most people). Perhaps you could try leveraging them, stack it, etc. Play around. Here is a guide on how to use it for sentence classification. I know, your domain is different. But I think the intuition carries over. :)

Thanks for answer, my application is character oriented, not words so embedding is unnecessary. It is like x = 50 characters in sequence and y = one hot label with next character and then move one character and next 50 characters. I will try other ideas you described. — michalovsky, Nov 09 '19 at 14:04

Problem with text prediction LSTM neural networks

1 Answers1