4

Using the code below (in a google colab), I noticed that multiplying a 62x62 with a 62x62 is about 10% slower than multiplying a 64x64 with another 64x64 matrix. Why is this?

import torch
import timeit

a, a2 = torch.randn((62, 62)), torch.randn((62, 62))
b, b2 = torch.randn((64, 64)), torch.randn((64, 64))

def matmuln(c,d):
    return c.matmul(d)

print(timeit.timeit(lambda: matmuln(a, a2), number=1000000)) # 13.864160071000015
print(timeit.timeit(lambda: matmuln(b, b2), number=1000000)) # 12.539578468999991
Robin van Hoorn
  • 334
  • 1
  • 10
  • @DarkKnight usually naive multiplication of matrices that are a power of 2 will be significantly slower [Why is my program slow when looping over exactly 8192 elements?](https://stackoverflow.com/q/12264970/995714), [Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?](https://stackoverflow.com/q/6060985/995714), [Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?](https://stackoverflow.com/q/11413855/995714). For cache-aware algorithms it may be different – phuclv May 12 '23 at 11:21
  • 2
    @phuclv Interesting except that OP reports (and I concur with my own tests) that shapes that are powers of 2 are **faster** than similarly size/shaped objects – DarkKnight May 12 '23 at 11:36
  • idea: torch switches between two different implementations of matmul, depending on the size of the input matrices, and boundary is at 64*64. – dankal444 May 12 '23 at 13:11
  • 1
    I checked other sizes, and it seems that it is faster if it is multiples of 4, not just powers of 2. I.e., 4n is always faster than 4n-1. Maybe this is due to SIMD. – ken May 12 '23 at 16:25
  • @DarkKnight that's why I said "usually naive multiplication" and "cache-aware algorithms" – phuclv May 15 '23 at 17:46

0 Answers0