How to convert Tensorflow Multi-head attention to PyTorch?

Question

I'm converting a Tensorflow transformer model to Pytorch equivalent. In TF multi-head attention part of the code I have: att = layers.MultiHeadAttention(num_heads=6, key_dim=4) and the input shape is [None, 136, 4] where None is the batch size, 136 is sequence length and 4 is embedding dimension. num_heads is number of heads and key_dim is dimension of each head. It gets the input as : att(query=input, value=input)

In pyTorch, the MHA is defined as att = nn.MultiheadAttention(embed_dim, num_heads) where embed_dim is the dimension of the model (Not dim of each head) and it must be divisible by num_heads = 6. Since the input shape is [None, 136, 4] and 4 is not divisible by 6, Pytorch rises an error about divisibility. how should I change my input to be able to use Pytorch instead of TF?

If key_dim in Tf is 4 and it has 6 heads, the whole model's dimension must be 46. I defined embed_dim in PyTorch equal to 46 but because of input size of [None, 136, 4], Pytorch rises error for dimension saying " expected input of size 24 but got 4". Is it ok to repeat my input for num_head times to fix the problem? Does TF feeds the input directly to each head without dividing it to number of heads? How can I convert TF MHA to Pytorch MHA?

How to convert Tensorflow Multi-head attention to PyTorch?

0 Answers0