I'm converting a Tensorflow transformer model to Pytorch equivalent.
In TF multi-head attention part of the code I have:
att = layers.MultiHeadAttention(num_heads=6, key_dim=4)
and the input
shape is [None, 136, 4]
where None is the batch size, 136 is sequence length and 4 is embedding dimension. num_heads
is number of heads and key_dim
is dimension of each head.
It gets the input as : att(query=input, value=input)
In pyTorch, the MHA is defined as att = nn.MultiheadAttention(embed_dim, num_heads)
where embed_dim
is the dimension of the model (Not dim of each head) and it must be divisible by num_heads = 6
.
Since the input shape is [None, 136, 4]
and 4 is not divisible by 6, Pytorch rises an error about divisibility. how should I change my input to be able to use Pytorch instead of TF?
If key_dim
in Tf is 4 and it has 6 heads, the whole model's dimension must be 46. I defined embed_dim
in PyTorch equal to 46 but because of input size of [None, 136, 4]
, Pytorch rises error for dimension saying " expected input of size 24 but got 4".
Is it ok to repeat my input for num_head
times to fix the problem?
Does TF feeds the input directly to each head without dividing it to number of heads?
How can I convert TF MHA to Pytorch MHA?