0

To calculate self-attention, For each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process defined as WQ, WK, WV matrix.

Question: are these matrices WQ, WK, WV same for every input word (embedding) or they are different for different different words?

Paper link

Vinay Sharma
  • 319
  • 1
  • 5
  • 13

0 Answers0