I'm trying to build up Speech Transformer models using Tensorflow 2.1.0. There is a line I cannot understand in the Tensorflow tutorial.
sqrt(dim) is multiplied to encoded inputs to both encoders and decoders in the tutorial for Transformer on https://www.tensorflow.org/tutorials/text/transformer.
In the Encoder class,
def call(self, x, training, mask):
seq_len = tf.shape(x)[1]
# adding embedding and position encoding.
x = self.embedding(x) # (batch_size, input_seq_len, d_model)
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32)) <<== This the line I'm asking about now. (You can also find exact same line in the call function of Decoder class.)
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x, training=training)
for i in range(self.num_layers):
x = self.enc_layers[i](x, training, mask)
return x # (batch_size, input_seq_len, d_model)
Could you explain the reason why? (I could not find any lines multiplying sqrt(dim) to the encoded inputs in the official transformer model on https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py.)