Dimension error by using Patch Embedding for video processing

Question

I am working on one of the transformer models that has been proposed for video classification. My input tensor has the shape of [batch=16 ,channels=3 ,frames=16, H=224, W=224] and for applying the patch embedding on the input tensor it uses the following scenario:

patch_dim = in_channels * patch_size ** 2
self.to_patch_embedding = nn.Sequential(
        Rearrange('b t c (h p1) (w p2) -> b t (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
        nn.Linear(patch_dim, dim),     ***** (Root of the error)******
    )

The parameters that I am using are as follows:

 patch_size =16 
 dim = 192
 in_channels = 3

Unfortunately I receive the following error that corresponds to the line that has been shown in the code:

Exception has occured: RuntimeError
mat1 and mat2 shapes cannot be multiplied (9408x4096 and 768x192)

I thought a lot on the reason of the error but I couldn't find out what is the reason. How can I solve the problem?

Markus · Accepted Answer · 2022-10-31T00:05:59.803

The input tensor has shape [batch=16, channels=3, frames=16, H=224, W=224], while Rearrange expects dimensions in order [ b t c h w ]. You expect channels but pass frames. This leads to a last dimension of (p1 * p2 * c) = 16 * 16 * 16 = 4096.

Please try to align positions of channels and frames:

from torch import torch, nn
from einops.layers.torch import Rearrange

patch_size = 16
dim = 192

b, f, c, h, w = 16, 16, 3, 224, 224
input_tensor = torch.randn(b, f, c, h, w)

patch_dim = c * patch_size ** 2

m = nn.Sequential(
    Rearrange('b t c (h p1) (w p2) -> b t (h w) (p1 p2 c)', p1=patch_size, p2=patch_size),
    nn.Linear(patch_dim, dim)
)

print(m(input_tensor).size())

Output:

torch.Size([16, 16, 196, 192])

Dimension error by using Patch Embedding for video processing

1 Answers1