If we use beam search in seq2seq model it will give more proper results. There are several tensorflow implementations. But with the softmax function in each cell you can't use beam search in the training process. So is there any other modified optimization function when using beam search?
4 Answers
As Oliver mentioned in order to use beam search in the training procedure we have to use beam search optimization which is clearly mentioned in the paper Sequence-to-Sequence Learning as Beam-Search Optimization.
We can't use beam search in the training procedure with the current loss function. Because current loss function is a log loss which is taken on each time step. It's a greedy way. It also clearly mentioned in the this paper Sequence to Sequence Learning with Neural Networks. In the section 3.2 it has mentioned the above case neatly.
"where S is the training set. Once training is complete, we produce tr anslations by finding the most likely translation according to the LSTM:"
So the original seq2seq architecture use beam search only in the testing time. If we want to use this beam search in the training time we have to use another loss and optimization method as in the paper.

- 3,951
- 6
- 33
- 73
Sequence-to-Sequence Learning as Beam-Search Optimization is a paper that describes the steps neccesary to use beam search in the training process. https://arxiv.org/abs/1606.02960
The following issue contains a script that can perform the beam search however it does not contain any of the training logic https://github.com/tensorflow/tensorflow/issues/654

- 150
- 5
-
I am asking about the optimization function. As in the paper describes do we use different one? – Shamane Siriwardhana May 30 '17 at 01:49
No, we do not need to use a beam search in the training stage. When training modern-day seq-to-seq models like Transformers we use teacher enforcing training mechanism, where we feed right-shifted target sequence to the decoder side. Again beam-search can improve generalizability, but it is not practical to use in the training stage. But there are alternatives like the use of loss function label-smoothed-cross-entropy.

- 3,951
- 6
- 33
- 73
What I understand is, if loss is calculated at individual word level, there is no sense of sequence. A bad sequence(with mostly random words) can have loss similar to a better sequence(with mostly connected words) as loss can be spread in different ways over the vocabulary.