I am discovering numba's cuda extension, and looked at an example implementation of a matrix multiply on CUDA. The code is on numba's web site.
I then benchmarked it with what I thought was a less optimal implementation: numpy's dot function, to multiply two 1024x1024 matrices (generated with randn(1024,1024)
)
Results:
- CUDA 40 ms per multiply,
- numpy 5 ms per multiply.
If numpy's algorithm is the naive matrix multiply, then it should need 1024^3 ~ 1e9 multiplies and adds. That's an average throughput of one operation per 5ms/1e9 = 5 picoseconds. My CPU runs at approx 3.4 GHz, so each cycle takes 300 picoseconds.
So here is my question: how can numpy's matrix multiply be 60 times faster than a naive one?
I heard of Strassen's algorithm, that has a complexity of N^2.8 approximately, so each multiply and add takes 20 picoseconds. Still 30 times faster than what a CPU can achieve.
Edit:
- Definition of cuda method
from numba import cuda, float32
TPB = 16
@cuda.jit()
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
- Cuda call
N=1024
a=randn(N,N).astype(np.float32)
b=a.T.copy()
c=zeros((N,N),dtype=np.float32)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(a.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(a.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](a,b,c) # takes 40ms
Meanwhile, with numpy:
c=a.dot(b) # takes 5ms
I do not think that the throughput is a bottleneck as the matrices have a size in the millions and the cycles required are in the billions.
As asked by one commenter, the arrays are in 32 bit float.
EDIT 2:
I understand that a 3GHz CPU cannot perform a single task in 5ps so obviously I meant average throughput. Since the throughput is 30 times better than an estimate assuming 1 multiply and add per cycle, this means the implementation is severely optimised, using
- vector operations (SSE or AVX I presume)
- maybe also operations over multiple cores/CPUs.
For CUDA, the base type I tested was single precision floats, assuming beforehand that their sweet spot is single or half precision floats.
- The compilation time of the CUDA function had no substantial impact. The benchmarking macro (%%timeit) was not sensitive to that. However,
- I thought the transfer between main memory and GPU memory should not impact the performance figures substantially since the size of the transfer was much smaller than the number of computations (several orders of magnitude smaller).
- I tried to verify that by transferring the arrays to the CUDA device beforehand, and then ... woosh the compute time went from 40ms down to 144µs. So the memory <-> device transfer is massively expensive. Thanks @talonmies for pointing that out.
The take away message for me though, is that compilation for CPUs can now perform extremely aggressive optimisations (vector ops, multi-thread), and it can in some cases be hard to beat that with a GPU, even for computation processes that have a very regular pattern. Transfer to/from device can be prohibitively expensive even if you expect to perform 1000 computations or so per input sample.
Last but not least, I'd like to thank @user2640045 for his/her benchmark and log interpolation, which seems to show that the computation is in O(N^3), so numpy seems to use the simplest matrix product implementation.