Why multiply sqrt(dim) to the encoded input for Transformer in the Tensorflow tutorial?

Question

I'm trying to build up Speech Transformer models using Tensorflow 2.1.0. There is a line I cannot understand in the Tensorflow tutorial.

sqrt(dim) is multiplied to encoded inputs to both encoders and decoders in the tutorial for Transformer on https://www.tensorflow.org/tutorials/text/transformer.

In the Encoder class,

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))  <<== This the line I'm asking about now. (You can also find exact same line in the call function of Decoder class.)
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

Could you explain the reason why? (I could not find any lines multiplying sqrt(dim) to the encoded inputs in the official transformer model on https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py.)

It was discussed here a bit: https://stackoverflow.com/questions/56930821/why-does-embedding-vector-multiplied-by-a-constant-in-transformer-model — Anastasiia Iurshina, Mar 22 '20 at 09:27

Why multiply sqrt(dim) to the encoded input for Transformer in the Tensorflow tutorial?

0 Answers0