I'm trying to make a seq2seq chatbot with Tensorflow, but it seems to converge to the same outputs despite different inputs. The model gives different outputs when first initialized, but quickly converges to the same outputs after a few epochs. This is still an issue even after a lot of epochs and low costs. However, the models seems to do fine when trained with smaller datasets (say 20) but it fails with larger ones.
I'm training on the Cornell Movie Dialogs Corpus with a 100-dimensional and 50000-vocab glove pretrained embedding.
The encoder seems to have very close final states (in the ranges of around 0.01) when given totally different inputs. I've tried using a simple LSTM/GRU, bidirectional LSTM/GRU, multi-layer/stacked LSTM/GRU, and multi-layer bidirection LSTM/GRU. The rnn nodes have been tested with from 16 to 2048 hidden units. The only difference is that the model tends to output only the start and end tokens (GO and EOS) when having lesser hidden units.
For multi-layer GRU, here's my code:
cell_encode_0 = tf.contrib.rnn.GRUCell(self.n_hidden)
cell_encode_1 = tf.contrib.rnn.GRUCell(self.n_hidden)
cell_encode_2 = tf.contrib.rnn.GRUCell(self.n_hidden)
self.cell_encode = tf.contrib.rnn.MultiRNNCell([cell_encode_0, cell_encode_1, cell_encode_2])
# identical decoder
...
embedded_x = tf.nn.embedding_lookup(self.word_embedding, self.x)
embedded_y = tf.nn.embedding_lookup(self.word_embedding, self.y)
_, self.encoder_state = tf.nn.dynamic_rnn(
self.cell_encode,
inputs=embedded_x,
dtype=tf.float32,
sequence_length=self.x_length
)
# decoder for training
helper = tf.contrib.seq2seq.TrainingHelper(
inputs=embedded_y,
sequence_length=self.y_length
)
decoder = tf.contrib.seq2seq.BasicDecoder(
self.cell_decode,
helper,
self.encoder_state,
output_layer=self.projection_layer
)
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, maximum_iterations=self.max_sequence, swap_memory=True)
return outputs.rnn_output
...
# Optimization
dynamic_max_sequence = tf.reduce_max(self.y_length)
mask = tf.sequence_mask(self.y_length, maxlen=dynamic_max_sequence, dtype=tf.float32)
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=self.y[:, :dynamic_max_sequence], logits=self.network())
self.cost = (tf.reduce_sum(crossent * mask) / batch_size)
self.train_op = tf.train.AdamOptimizer(self.learning_rate).minimize(self.cost)
For the full code, please see github. (In case if you want to test it out, run train.py)
As for hyper-parameters, I've tried learning rates from 0.1 all the way to 0.0001 and batch sizes from 1 to 32. Other than the regular and expected effects, they do not help with the problem.