How to use use cuda to optimize my matrix multiplication code

Question

I am trying to see how I can optimize my matrix multiplication code using Cuda. Currently, I am using jit and Im getting a result of 346 ms for the speed.

I appreciate any help you can give me!

Error:

TypeError: __init__() got an unexpected keyword argument 'py_func'

My code:

from numba import cuda, float32, vectorize, jit, prange


matrix1 = cp.random.uniform(1,10,size=(1000,1000), dtype=np.float64)
matrix2 = cp.random.uniform(1,10, size=(1000,1000), dtype=np.float64)
rmatrix = cp.zeros(shape=(1000,1000), dtype=np.float64)

#multiplication function

@jit(target='cuda')
def gpu_matrix_multiplication(matrix1,matrix2,rmatrix):
  for i in prange(len(matrix1)):
    for j in prange(len(matrix2)):
      for k in prange(len(matrix2)):
        rmatrix[i][j] += matrix1[i][k] * matrix2[k][j]

#Calculate running time                                                                                                             
%timeit gpu_matrix_multiplication(matrix1,matrix2,rmatrix)

How can you have a report on how long it takes to run, and also an error, unless you're timing how long it takes to throw the error? — roganjosh, Apr 30 '22 at 19:55
Please *do not write your own matrix multiplication* in CUDA. Use CuBLAS instead! The code you will write will certainly be very very slow compare CuBLAS (especially the one provided which is known to be a very inefficient naive implementation). Moreover, besides the mentioned error, the current code have a *race condition* resulting in *wrong results*! On CPU, please use BLAS and not such implementation. BLAS are optimized since decades by clever people so to achieve nearly optimal performance. — Jérôme Richard, Apr 30 '22 at 20:19
@roganjosh Sorry for the vague explanation. I used the same program with just the ```@jit``` decorator and it took 346 ms to execute. I was trying to see if I can speed it up by using Cuda. — Qi'ra, Apr 30 '22 at 20:40
Besides this, AFAIK `target='cuda'` keyword is deprecated (AFAIK since a long time) and should do nothing here. This may be the source of the error. Additionally, CUDA code needs to be written differently from basic CPU ones. For more information, please read the documentation: https://numba.readthedocs.io/en/stable/cuda/index.html — Jérôme Richard, Apr 30 '22 at 20:43
Did you tried with `rmatrix = matrix1 @ matrix2`? It should use a BLAS function internally. To use CuBLAS, consider using CuPy (with the same line). It should be much faster than your solution, simpler and correct. — Jérôme Richard, Apr 30 '22 at 20:45
@JérômeRichard For my assignment, I need to use Cuda to optimize my sequential Python code. Any advice on how to change my code so that it works with Cuda? Also, Thank you the CuPy was faster and I got a 133 µs per loop speedup! — Qi'ra, Apr 30 '22 at 20:56
[here](https://numba.pydata.org/numba-doc/dev/cuda/examples.html) is a documented example of how to do matrix multiplication with numba/cuda — Robert Crovella, Apr 30 '22 at 21:00
@Qi'ra Note that Cupy does lazy computation so the operation you measure may not be the time to compute it but only to prepare the work on the GPU. you need to force the computation to be done eagerly for measurements. 133 us seems a bit too small to me to compute the input matrix (unless you use a new very-fast expensive server GPU). — Jérôme Richard, Apr 30 '22 at 21:20

How to use use cuda to optimize my matrix multiplication code

0 Answers0