I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint:
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
Why require the constraint: embed_dim must be divisible by num_heads?
If we go back to the equation
Assume:
Q
, K
,V
are n x emded_dim
matrices; all the weight matrices W
is emded_dim x head_dim
,
Then, the concat [head_i, ..., head_h]
will be a n x (num_heads*head_dim)
matrix;
W^O
with size (num_heads*head_dim) x embed_dim
[head_i, ..., head_h] * W^O
will become a n x embed_dim
output
I don't know why we require embed_dim must be divisible by num_heads
.
Let say we have num_heads=10000
, the resuts are the same, since the matrix-matrix product will absort this information.