Why is numpy.dot as fast as these GPU implementations of matrix multiplication?

Question

According to the following table (from this paper), numpy's np.dot performance is comparable to a CUDA implementation of matrix multiplication, in experiments with 320x320 matrices. And I did replicate this Speedup in my machine for np.dot with enough precision. Their code for CUDA with Numba ran much slower though, with a Speedup of about 1200 instead of the 49258 reported.

Why is numpy's implementation so fast?

Edit: here's the code taken from the paper. I just added the timeit calls. I ran it in the following laptop.

CUDA

import numpy as np
from numba import cuda
@cuda.jit('void( float64 [ : , : ] , float64 [ : , : ] , float64 [ : , : ] , int32 )')
def cu_matmul(a , b, c , n) :
    x, y = cuda.grid (2)
    if (x >= n) or (y >= n) :
        return
    c[x, y] = 0
    for i in range(n) :
        c[x, y] += a[x, i ] * b[ i , y]

device = cuda.get_current_device()
tpb = device.WARP_SIZE
n = 320
bpg = (n+tpb-1)//tpb
grid_dim = (bpg, bpg)
block_dim = (tpb , tpb)
A = np.random.random((n, n ) ).astype (np. float64 )
B = np.random.random((n, n ) ).astype (np. float64 )
C = np.empty((n, n) , dtype=np.float64 )
dev_A = cuda.to_device(A)
dev_B = cuda.to_device(B)
dev_C = cuda.to_device(C, copy=False )
result_cuda = cu_matmul[grid_dim , block_dim](dev_A, dev_B, dev_C, n)
dev_C. copy_to_host(C)
assert (np. allclose (np. dot(A, B) , C))

Numpy

np.dot(A, B)

System specs

I'm voting to close this question as off-topic because there is no particular code problem, with example code, to solve — mtrw, Oct 16 '19 at 00:39
So.... many... variables here.. What code was used? In what machine? What are the specifications? Does the person that wrote the code know what they're doing (i.e. are they fluent in python/numpy/CUDA) ? What is the test objectively? It is so easy to fall in the trap of running and measuring _an implementation that you built yourself_ and then generalizing the results as a behavior for the whole programming language itself.. — rafaelc, Oct 16 '19 at 01:26
Ok, I added the code, system info and changed the title slightly. That generalization was unintentional. — bwdm, Oct 16 '19 at 10:47
Possible duplicate of [How does BLAS get such extreme performance?](https://stackoverflow.com/questions/1303182/how-does-blas-get-such-extreme-performance) — norok2, Oct 16 '19 at 12:10
I don't think this can be a duplicate of that question because the subjects are different, unless you happen to know how implementation details of Numpy. But a valid answer could state that Numpy uses BLAS and then point to that question. — bwdm, Oct 16 '19 at 13:29

rubenvb · Accepted Answer · 2020-07-22T10:20:35.763

Aside from what @norok2 links to, there is the large overhead of transferring the data to the GPU. This becomes significant in several cases:

it is comparably expensive to what you do on the GPU when compared to data transfer overhead, i.e. you only do one operation on less than a MB of data.
The size of your problem doesn't scale extremely well. This is the case if your data size or your underlying problem don't allow the GPU to use its parallel processing sufficiently.
There are too many branches in your parallel code. This usually means a large set of parallel processors needs to wait on each branch (branching hardware is usually grouped per X number of arithmetic processors on a GPU), slowing down the whole computation.

Both points apply here. 320x320 is not extremely large, and a multiplication is the only thing you're doing. CPUs aren't obsoleted by GPUs by far, and let this type of thing prove exactly that.

score 1 · Answer 2 · answered Oct 16 '19 at 12:09

1

NumPy is so fast because it uses a highly optimized BLAS library which is likely to be using the SIMD instructions of your CPU.

answered Oct 16 '19 at 12:09

norok2

25,683
4
73
99

Why is numpy.dot as fast as these GPU implementations of matrix multiplication?

CUDA

Numpy

System specs

2 Answers2

Linked