Why is multiplying a 62x62 matrix slower than multiplying a 64x64 matrix?

Question

Using the code below (in a google colab), I noticed that multiplying a 62x62 with a 62x62 is about 10% slower than multiplying a 64x64 with another 64x64 matrix. Why is this?

import torch
import timeit

a, a2 = torch.randn((62, 62)), torch.randn((62, 62))
b, b2 = torch.randn((64, 64)), torch.randn((64, 64))

def matmuln(c,d):
    return c.matmul(d)

print(timeit.timeit(lambda: matmuln(a, a2), number=1000000)) # 13.864160071000015
print(timeit.timeit(lambda: matmuln(b, b2), number=1000000)) # 12.539578468999991

@DarkKnight usually naive multiplication of matrices that are a power of 2 will be significantly slower [Why is my program slow when looping over exactly 8192 elements?](https://stackoverflow.com/q/12264970/995714), [Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?](https://stackoverflow.com/q/6060985/995714), [Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?](https://stackoverflow.com/q/11413855/995714). For cache-aware algorithms it may be different — phuclv, May 12 '23 at 11:21
@phuclv Interesting except that OP reports (and I concur with my own tests) that shapes that are powers of 2 are **faster** than similarly size/shaped objects — DarkKnight, May 12 '23 at 11:36
idea: torch switches between two different implementations of matmul, depending on the size of the input matrices, and boundary is at 64*64. — dankal444, May 12 '23 at 13:11
I checked other sizes, and it seems that it is faster if it is multiples of 4, not just powers of 2. I.e., 4n is always faster than 4n-1. Maybe this is due to SIMD. — ken, May 12 '23 at 16:25
@DarkKnight that's why I said "usually naive multiplication" and "cache-aware algorithms" — phuclv, May 15 '23 at 17:46

Why is multiplying a 62x62 matrix slower than multiplying a 64x64 matrix?

0 Answers0