Avoid overfitting in sequence to sequence problem using keras

Question

I'm having a problem with a model I want to train.

It's a typical seq-to-seq problem with an attention layer, where the input is a string, and the output is a substring from the submitted string.

e.g.

Input            Ground Truth
-----------------------------
helloimchuck     chuck
johnismyname     john

(This is just a dummy data, not a real part of the dataset ^^)

And the model looks like this:

model = Sequential()
model.add(Bidirectional(GRU(hidden_size, return_sequences=True), merge_mode='concat',
                        input_shape=(None, input_size)))  # Encoder
model.add(Attention())
model.add(RepeatVector(max_out_seq_len))
model.add(GRU(hidden_size * 2, return_sequences=True))  # Decoder
model.add(TimeDistributed(Dense(units=output_size, activation="softmax")))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=['accuracy'])

The problem is this here:

As you can see, there is overfitting.

I'm using early stop criteria on the validation loss with patience=8.

self.Early_stop_criteria = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0,
                                                             patience=8, verbose=0,
                                                             mode='auto')

And I'm using one-hot-vector to fit the model.

BATCH_SIZE = 64
HIDDEN_DIM = 128

The thing is, I've tried with other batch sizes, other hidden dimensions, a dataset of 10K rows, 15K rows, 25K rows and now 50K rows. However, there is always overfitting, and I don't know why.

The test_size = 0.2 and the validation_split=0.2. Those are the only parameters I haven't changed.

I'm also made me sure that the dataset properly build.

The only idea that I have is trying with another validation split, maybe 0.33 instead of 0.2.

I don't know if cross-validation would help.

Maybe anyone has a better idea, what I could try. Thanks in advance.

Not really. Would you use it just between the input and the first hidden layer? — Chuck Aguilar, Jan 08 '19 at 17:23
There are 2 types of dropout you can use. Refer to [this answer](https://stackoverflow.com/a/44929759/10111931) and [this answer](https://stackoverflow.com/a/50721621/10111931) for more details on what the difference is and how to use them with Keras and where we can use them :) — kvish, Jan 08 '19 at 17:26

score 1 · Accepted Answer · answered Jan 16 '19 at 10:33

As kvish proposed, dropout was a good solution.

I first tried with a dropout of 0.2.

model = Sequential()
model.add(Bidirectional(GRU(hidden_size, return_sequences=True, dropout=0.2), merge_mode='concat',
                            input_shape=(None, input_size)))  # Encoder
model.add(Attention())
model.add(RepeatVector(max_out_seq_len))
model.add(GRU(hidden_size * 2, return_sequences=True))  # Decoder
model.add(TimeDistributed(Dense(units=output_size, activation="softmax")))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=['accuracy'])

And with 50K rows, it worked, but still had overfitting.

So, I tried with a dropout of 0.33, and it worked perfectly.

Avoid overfitting in sequence to sequence problem using keras

1 Answers1