cuda.jit matrices multiplication crashes

Question

I'm trying to write cuda.jit matrices multiplication with a bound on my number of thread-blocks, it can be only one. And I also know that my multiplication is of the form X*Xtranspose.

    def matmul_gpu(X, Y):
    # Allocate the output matrix in GPU memory using cuda.to_device
    #
    # invoke the dot kernel with 1 threadBlock with 1024 threads
    #
    # copy the output matrix from GPU to cpu using copy_to_host()
    gpu_mat1 = cuda.to_device(X)
    gpu_mat2 = cuda.to_device(Y)
    res = np.zeros(shape=(X.shape[0], Y.shape[1]), dtype=np.float32)
    gpu_mult_res = cuda.to_device(res)
    threads_per_block = 1024
    blocks_per_grid = 1
    matmul_kernel[blocks_per_grid, threads_per_block](
        gpu_mat1, gpu_mat2, gpu_mult_res)
    mult_res = gpu_mult_res.copy_to_host()
    return mult_res


@cuda.jit
def matmul_kernel(A, B, C):
    num_of_threads = cuda.gridsize(1)
    tid = cuda.grid(1)
    rows_num = A.shape[0]
    cols_num = A.shape[1]
    step = int(np.math.ceil(num_of_threads / cols_num))
    row = int(np.math.floor(tid / cols_num))
    col = int(tid % cols_num)
    for row_start_idx in range(0, rows_num, step):
        if row_start_idx + row < rows_num and col < cols_num:
            C[row_start_idx + row, col] += A[row_start_idx + row, tid] * B[tid, col]

It crashes for matrices of dimensions: 128,256 or 256,128 and it throws those errors in that order with the traceback.

...
 Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
...
Call to cuMemFree results in UNKNOWN_CUDA_ERROR

it works for a very large dimensions like 1024, 2048 and 2048, 1024, and it works great for input with same dimensions, but with different sizes sometimes it throws the mentioned above errors. It almost never throw errors for equal dimensions, except 256*256 which I just noticed, so it should be something related to those.

Code for debugging assistance:

# this is the comparison function - keep it as it is, don't change X or Y.
def matmul_comparison():
    X = np.random.randn(1000, 1024)
    Y = np.random.randn(1024, 1000)

    def timer(f):
        return min(timeit.Timer(lambda: f(X, Y)).repeat(3, 5))

    # print('Python:', timer(matmul_trivial)) we will not consider this since it takes infinite time :)
    #print('Numpy:', timer(np.matmul))
    #print('Numba:', timer(matmul_numba))
    print('CUDA:', timer(matmul_gpu))


if __name__ == '__main__':
    os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so'
    os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda-9.0/nvvm/libdevice/'
    matmul_comparison()

Run your code with `cuda-memcheck`. You should also verify the results numerically before asserting that things are running correctly. You should also provide a complete code, when asking for debugging assistance. Your code is doing illegal indexing. There are several conceptual problems with the code. — Robert Crovella, Dec 24 '19 at 04:04
thank you, I added the code for debugging. I tried some simple examples for at least dimension 2, for example 4*512, 512*4 with 1024 threads, I did some calculations on the paper and it looked fine for me, maybe I got confused. Illegal indexing, yes you right, correct if I wrong, but i should do tid < cols_num in the if statement. Which other conceptual problems did you noticed ? — Ilya.K., Dec 24 '19 at 04:18
for a very big input it gives me pretty good speedup, 3 times faster then numpy, and 13 times better then numba njit. I don't know though if it can be a lot better or not for this kind of problem where I have only one thread-block. For small-medium input cuda.jit gives same performance as numpy, which both much better then njit numba. What do you think about those results, are they expected, good ? — Ilya.K., Dec 24 '19 at 04:27
none of the code you've shown verifies numerical correctness. — Robert Crovella, Dec 24 '19 at 04:39
Numerically it isn't correct, I'm not sure even if I can somehow fix it to be so, because I have 3 iterated indexes, And I tried somehow to use each thread to calculate more then one sum. I'm not sure if it's even possible. Is it ? I think about the question, do I need another inner loop. — Ilya.K., Dec 24 '19 at 05:12
@RobertCrovella , as I see now, it was pretty much a bad idea in the first place. The right and the most parallelize way to do that, with only one block, for exmaple with a matrice 1*512 and 512*1 to create a matrice 512*512, firstly calculate all the multiplications and then reduce them with kogge-stone adder algorithm into the result dimensions, isn't it ? — Ilya.K., Dec 24 '19 at 05:45
The way I would do this is simply to do an ordinary matrix multiplication but using a grid-stride loop. — Robert Crovella, Dec 24 '19 at 13:52

score 3 · Accepted Answer · answered Dec 24 '19 at 15:02

A few general comments:

I wouldn't declare that things are working unless I had verified numerical correctness.
I think its good practice to do error checking. The cuda-memcheck tool can check for various errors, even with numba python CUDA codes. Your code throws errors, even for the sizes you suggest are working correctly.
A naive matrix multiplication has a fairly typical format and is covered in a variety of places such as in the CUDA programming guide. If I were working on this, I would use that as my starting point, if possible.
From a performance perspective, arbitrarily limiting a CUDA code to run on a single threadblock of 1024 threads is a really bad idea. I can't imagine why you would do that.
In spite of that, if I wanted to use an arbitrary grid arrangement to process a CUDA algorithm, a canonical technique would be a grid-stride loop.

With respect to your code, a few concerns presented themselves immediately:

For a canonical matrix multiplication, I would normally expect to derive the computation range from the result (C) matrix, not the A matrix. If you restrict yourself to the X*Xt case, then I think your use of A should be OK. In the general case it is not.
Its quite evident to me that you have indexing problems. I'm not going to try to sort them all out or even identify them all, but one problem I have indicated already. Your tid variable covers a range of 0..1023 due to your grid size choice, and that cannot possibly be correct for this indexing pattern: B[tid, col] (excepting the case where the number of rows of B is equal to 1024).
It appears to me that you have the possibility for multiple threads to be writing to the same output location in the C matrix. CUDA does not sort this out for you. You should not expect having multiple threads write to the same output location to work correctly, unless you have taken steps to make it so, either through atomics or via a classical parallel reduction. And I wouldn't want to introduce either of those methods into this problem, so I consider the fundamental approach to be troublesome.

There may be other problems as well. But because of consideration 3 above, rather than try to fix your code, I would rather start with the canonical naive matrix multiply and use a grid-stride loop.

Here's an example incorporating those ideas:

$ cat t59.py
import numpy as np
from numba import cuda,jit


@cuda.jit
def matmul_kernel(A, B, C):
    num_of_threads = cuda.gridsize(1)
    tid = cuda.grid(1)
    rows_num = C.shape[0]
    cols_num = C.shape[1]
    idx_range = A.shape[1]
    for mid in range(tid, rows_num*cols_num, num_of_threads):
        row = mid // cols_num
        col = mid - (row*cols_num)
        my_sum = 0.0
        for idx in range(0, idx_range):
            my_sum += A[row, idx] * B[idx, col]
        C[row, col] = my_sum

def matmul_gpu(X, Y):
    # Allocate the output matrix in GPU memory using cuda.to_device
    #
    # invoke the dot kernel with 1 threadBlock with 1024 threads
    #
    # copy the output matrix from GPU to cpu using copy_to_host()
    gpu_mat1 = cuda.to_device(X)
    gpu_mat2 = cuda.to_device(Y)
    res = np.zeros(shape=(X.shape[0], Y.shape[1]), dtype=np.float32)
    gpu_mult_res = cuda.to_device(res)
    threads_per_block = 1024
    blocks_per_grid = 1
    matmul_kernel[blocks_per_grid, threads_per_block](
        gpu_mat1, gpu_mat2, gpu_mult_res)
    mult_res = gpu_mult_res.copy_to_host()
    return mult_res

wA = 256
hA = 128
wB = hA
hB = wA


mA = np.ones(shape=(hA,wA), dtype=np.float32)
mB = np.ones(shape=(hB,wB), dtype=np.float32)
mC = matmul_gpu(mA,mB)
print(mC)
$ cuda-memcheck python t59.py
========= CUDA-MEMCHECK
[[ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 ...,
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]]
========= ERROR SUMMARY: 0 errors
$

cuda.jit matrices multiplication crashes

1 Answers1

Linked