Why are CUDA GPU matrix multiplies slower than numpy? How is numpy so fast?

Question

I am discovering numba's cuda extension, and looked at an example implementation of a matrix multiply on CUDA. The code is on numba's web site.

I then benchmarked it with what I thought was a less optimal implementation: numpy's dot function, to multiply two 1024x1024 matrices (generated with randn(1024,1024))

Results:

CUDA 40 ms per multiply,
numpy 5 ms per multiply.

If numpy's algorithm is the naive matrix multiply, then it should need 1024^3 ~ 1e9 multiplies and adds. That's an average throughput of one operation per 5ms/1e9 = 5 picoseconds. My CPU runs at approx 3.4 GHz, so each cycle takes 300 picoseconds.

So here is my question: how can numpy's matrix multiply be 60 times faster than a naive one?

I heard of Strassen's algorithm, that has a complexity of N^2.8 approximately, so each multiply and add takes 20 picoseconds. Still 30 times faster than what a CPU can achieve.

Edit:

Definition of cuda method

from numba import cuda, float32

TPB = 16

@cuda.jit()
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp

Cuda call

N=1024
a=randn(N,N).astype(np.float32)
b=a.T.copy()
c=zeros((N,N),dtype=np.float32)

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(a.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(a.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](a,b,c) # takes 40ms

Meanwhile, with numpy:

c=a.dot(b) # takes 5ms

I do not think that the throughput is a bottleneck as the matrices have a size in the millions and the cycles required are in the billions.

As asked by one commenter, the arrays are in 32 bit float.

EDIT 2:

I understand that a 3GHz CPU cannot perform a single task in 5ps so obviously I meant average throughput. Since the throughput is 30 times better than an estimate assuming 1 multiply and add per cycle, this means the implementation is severely optimised, using

vector operations (SSE or AVX I presume)
maybe also operations over multiple cores/CPUs.

For CUDA, the base type I tested was single precision floats, assuming beforehand that their sweet spot is single or half precision floats.

The compilation time of the CUDA function had no substantial impact. The benchmarking macro (%%timeit) was not sensitive to that. However,
I thought the transfer between main memory and GPU memory should not impact the performance figures substantially since the size of the transfer was much smaller than the number of computations (several orders of magnitude smaller).
I tried to verify that by transferring the arrays to the CUDA device beforehand, and then ... woosh the compute time went from 40ms down to 144µs. So the memory <-> device transfer is massively expensive. Thanks @talonmies for pointing that out.

The take away message for me though, is that compilation for CPUs can now perform extremely aggressive optimisations (vector ops, multi-thread), and it can in some cases be hard to beat that with a GPU, even for computation processes that have a very regular pattern. Transfer to/from device can be prohibitively expensive even if you expect to perform 1000 computations or so per input sample.

Last but not least, I'd like to thank @user2640045 for his/her benchmark and log interpolation, which seems to show that the computation is in O(N^3), so numpy seems to use the simplest matrix product implementation.

Your timings are undoubtedly wrong. Without seeing code, your Numba execution time almost certainly contains compilation and host<->device data transfers. And you limited description of the data suggests you are using double precision, which can put the GPU at an enormous throughput disadvantage, depending on what hardware you use — talonmies, Aug 12 '21 at 10:52
The reason we don't use GPU for _everything_ is because there are cases where its not fatest. talonmies mentioned host<->device data transfers, this tends to be quite a bottleneck in many GPU calls, as for small data, starting up the GPU, putting the data, taking it out, etc can be hundreds of times slower than the maths. — Ander Biguri, Aug 12 '21 at 10:57
numpy most certainly does not use a naive matmul implementation, and probably something faster than Strassen (I read about this some time ago, but can't remember in detail which algorithm is used, but the source code should be available), in addition it will use SIMD instruction sets, and probably multiple threads at the same time — geebert, Aug 12 '21 at 12:28
@geebert: The current answer shows that it scales with N^3, so it's presumably not using Strassen. So probably *algorithmically* simple, but implementation-wise obviously highly optimized in terms of SIMD and cache-blocking, and perhaps also parallelized to multiple cores. (Which would invalidate some of the "5 picosecond" calculation, which is already weird because they're not serialized, instead multiple operations in flight in 2x 256-bit SIMD pipelined FMA units, assuming a modern x86). Agreed that "naive" is not a good description of it. Perhaps "carefully-applied brute force". — Peter Cordes, Aug 12 '21 at 15:45
I asked to reopen the question. I copied and pasted code that was actually from numba.pydata.org in the link provided and the numpy dot method call. Plus the code showing that the base type actually was single precision float. — Chris, Aug 12 '21 at 15:47
Related re: the nonsense of saying "each operation **takes** 5ps" [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) - that's not how CPU performance works. Your *average throughput* can be one per 5ps, but there's no way you could do one FP math operation in 5 picoseconds on a CPU where the cycle time is much longer, and achieving that throughput involves doing 8 FMAs per asm instruction, with 4 cycle latency and 2/clock throughput (e.g. Skylake), per core. — Peter Cordes, Aug 12 '21 at 16:03
Ok now we see your code, my hypothesis is correct. As its name suggests, just-in-time compilation means that the function call includes compilation and all the API overhead associated with getting code on the device, and allocating memory for and copying data to and from the GPU. Run the code using pre-allocated GPU buffers and time the *second invocation*. You will observe significant differences. — talonmies, Aug 13 '21 at 02:49
@talonmies: that sounds like something you could post as an answer to this [benchmarking] question, especially if we retitle to simply ask why GPU multiplies are slower, instead of why numpy achieves close to the max CPU FMA throughput. — Peter Cordes, Aug 13 '21 at 12:55

talonmies · Answer 1 · 2021-08-14T04:35:39.010

Why are CUDA GPU matrix multiplies slower than numpy?

The really short answer is that they probably are not.

How is numpy so fast?

Because it generally uses a state-of-the-art, platform optimized (often multithreaded) BLAS implementation under the hood for the GEMM operation.

Your confusion as to why you have measured what you measure stems from some misconceptions about how Numba works, how numpy works, and how GPUs and CPUs work, as best as I can tell. Let's take a mild modification of your code, instrumented to measure time and convert to GFLOP/s for the gemm problem you are studying. It runs four cases, each 10 times:

Your original case of a Numba kernel with host source and destination arrays on the host (so device memory allocation, device to host and host to device memory allocation penalties are incurred in the execution time).
Your original case of a Numba kernel with host source and destination arrays on the device (so no allocation or copying)
The problem run on the host using numpy (which uses a state-of-the art host BLAS SGEMM implementation under the hood)
The problem run using cupy with source and destination on the device (which uses a state-of-the art GPU BLAS SGEMM implementation under the hood)

This looks like this:

from numba import cuda, float32
import numpy as np
import time
import math
import cupy

TPB = 16

@cuda.jit()
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp

N=1024
a=np.random.randn(N,N).astype(np.float32)
b=a.T.copy()
c=np.zeros((N,N),dtype=np.float32)

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(a.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(a.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

a_d=cuda.to_device(a)
b_d=cuda.to_device(b)
c_d=cuda.to_device(c)

gfcalc = lambda t0,t1: 1e-9 * (N * N * (2*N)) / (t1-t0)

for i in range(0,10):
  t0 = time.time()
  fast_matmul[blockspergrid, threadsperblock](a, b, c)
  cuda.synchronize()
  t1 = time.time()
  print("Numba, host source & dest", gfcalc(t0,t1))

for i in range(0,10):
  t0 = time.time()
  fast_matmul[blockspergrid, threadsperblock](a_d, b_d, c_d)
  cuda.synchronize()
  t1 = time.time()
  print("Numba, device source & dest", gfcalc(t0,t1))

for i in range(0,10):
  t0 = time.time()
  c_h = a.dot(b)
  t1 = time.time()
  print("Numpy", gfcalc(t0,t1))

with cupy.cuda.Device() as dev:
  a_c = cupy.asarray(a)
  b_c = cupy.asarray(b)
  c_c = cupy.asarray(c)
  for i in range(0,10):
    t0 = time.time()
    a_c.dot(b_c, out=c_c)
    dev.synchronize()
    t1 = time.time()
    print("cupy device source & dest", gfcalc(t0,t1))

The results are printed in GFLOP/s for each run separately, so we can see warm-up effects. Run on a Google Colab session of unknown providence:

Numba, host source & dest 7.157921582274139
Numba, host source & dest 35.41623777048565
Numba, host source & dest 41.53596793561073
Numba, host source & dest 40.067434107237034
Numba, host source & dest 41.853653713591996
Numba, host source & dest 49.797096688049365
Numba, host source & dest 73.13115945878288
Numba, host source & dest 70.30009174431994
Numba, host source & dest 74.86845532463607
Numba, host source & dest 76.87160119090733
Numba, device source & dest 105.97453061087833
Numba, device source & dest 105.56095086831826
Numba, device source & dest 105.78037879907214
Numba, device source & dest 106.06063296721804
Numba, device source & dest 106.87105343720403
Numba, device source & dest 104.89221337519061
Numba, device source & dest 106.34112058583715
Numba, device source & dest 103.33148924767107
Numba, device source & dest 106.94211047481143
Numba, device source & dest 104.29586223965393
Numpy 81.5965580615561
Numpy 70.34621140682276
Numpy 58.2601841797442
Numpy 79.16676998234227
Numpy 81.37246257365994
Numpy 83.9362524903643
Numpy 76.91820953485447
Numpy 68.72629315606706
Numpy 56.53278637482029
Numpy 76.50855577892254
cupy device source & dest 3298.132279290001
cupy device source & dest 2730.281677702635
cupy device source & dest 4824.423810787891
cupy device source & dest 4698.591160532599
cupy device source & dest 3971.4282428311253
cupy device source & dest 4943.578076147636
cupy device source & dest 3046.0599441126114
cupy device source & dest 2642.1822395837467
cupy device source & dest 3298.132279290001
cupy device source & dest 5144.0315561056495

What do we see:

The first Numba kernel call for your is extremely slow because it includes JIT compilation and GPU API overheads to get code on the GPU. After that we see a steady 40 odd GFLOP/s which increases to about 75 GFLOP/s after a few iterations. This is likely to to GPU clock frequency spin-up as the driver detects the GPU is under load
Your same Numba kernel is about 50% faster when the memory allocation, host to device transfers, and device to host transfers are removed from the kernel execution timing, hitting about 105 GFLOP/s.
The numpy dot call (which uses a highly optimized host BLAS GEMM) can hit about 75 GFLOP/s on average, which is slower than the Numba kernel performance on the GPU without having to include all the memory allocation and transfer overhead
The cupy dot call (which uses a highly optimized GPU BLAS GEMM) hits about 4000 GFLOP/s average, i.e. about 50 times faster than numpy run on the host. This is a true reflection of the peak floating point throughput of a compute GPU and a modern x86-64 CPU

So in conclusion, your assumptions about what is going on are not correct, and that is leading you to some misleading conclusions. Numpy's dot is a high performance and optimal implementation, but it is a lot slower than a dedicated compute GPU running a state-of-the-art BLAS implementation. Correct benchmarking is a science unto itself, and it is easy to come to erroneous conclusion unless you fully understand what you are doing.

Thanks for this insight. As I said, the compile time for the cuda code in my benchmarks did not impact my benchmark because I did several runs. The code I gave was simplified for legibility. My point was not say bad things about CUDA and offend anyone (which I feel I did), it was to underline that the CPU version of numpy's multiply was heavily optimised. — Chris, Aug 19 '21 at 07:42

Lukas S · Answer 2 · 2021-08-12T13:38:05.907

I don't believe your calculations with big O notation are valid. Keep in mind that this only gives an asymptotic complexity meaning that eventually the needed time is not growing faster than e.g. n^2.8. It does not say anything how fast or slow it is at first.

With this in mind I got interested in how fast it grows say for square matrices of size 1x1 to 2^12x2^12. So I ran it and fit a curve to the result times. Notice that I am using jupyter line magic to calculate the time so this is not pure python code.

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

results = []
n = 12
x = np.linspace(0,2**n,10)

for i in x:
    print(np.log2(i))
    size = np.round(i).astype('int')
    A = np.random.random((size, size))
    B = np.random.random((size, size))
    if i < 10:
        timeit = %timeit -n 100 -o A@B
    else:
        timeit = %timeit -n 1 -o A@B
    results.append(timeit.best)

results = np.array(results)
x2 = np.linspace(x.min(),x.max(), 10**4)

def func(x, a, b, n):
    return a*x**n+b

popt, pcov = curve_fit(func, x, results, p0=[results[0], 0, 3])

plt.figure()
plt.plot(x2, func(x2, *popt), 'r-')
plt.scatter(x,results)
plt.xlabel('nxn matrices')
plt.ylabel('seconds')
popt

array([ 1.55232100e-11, -8.11662366e-04,  3.00821421e+00])

Meaning our approximation is 1.55232100e-11*x^3.00821421e+00-8.11662366e-04. Though as I said before in principal this leaves all big O classes open sine we only looked at at finite set of runtimes but I think it's ok to say this suggests it is in O(n^3).

Why are CUDA GPU matrix multiplies slower than numpy? How is numpy so fast?

2 Answers2

Linked