I'm trying to write cuda.jit matrices multiplication with a bound on my number of thread-blocks, it can be only one. And I also know that my multiplication is of the form X*Xtranspose.
def matmul_gpu(X, Y):
# Allocate the output matrix in GPU memory using cuda.to_device
#
# invoke the dot kernel with 1 threadBlock with 1024 threads
#
# copy the output matrix from GPU to cpu using copy_to_host()
gpu_mat1 = cuda.to_device(X)
gpu_mat2 = cuda.to_device(Y)
res = np.zeros(shape=(X.shape[0], Y.shape[1]), dtype=np.float32)
gpu_mult_res = cuda.to_device(res)
threads_per_block = 1024
blocks_per_grid = 1
matmul_kernel[blocks_per_grid, threads_per_block](
gpu_mat1, gpu_mat2, gpu_mult_res)
mult_res = gpu_mult_res.copy_to_host()
return mult_res
@cuda.jit
def matmul_kernel(A, B, C):
num_of_threads = cuda.gridsize(1)
tid = cuda.grid(1)
rows_num = A.shape[0]
cols_num = A.shape[1]
step = int(np.math.ceil(num_of_threads / cols_num))
row = int(np.math.floor(tid / cols_num))
col = int(tid % cols_num)
for row_start_idx in range(0, rows_num, step):
if row_start_idx + row < rows_num and col < cols_num:
C[row_start_idx + row, col] += A[row_start_idx + row, tid] * B[tid, col]
It crashes for matrices of dimensions: 128,256 or 256,128 and it throws those errors in that order with the traceback.
...
Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
...
Call to cuMemFree results in UNKNOWN_CUDA_ERROR
it works for a very large dimensions like 1024, 2048 and 2048, 1024, and it works great for input with same dimensions, but with different sizes sometimes it throws the mentioned above errors. It almost never throw errors for equal dimensions, except 256*256 which I just noticed, so it should be something related to those.
Code for debugging assistance:
# this is the comparison function - keep it as it is, don't change X or Y.
def matmul_comparison():
X = np.random.randn(1000, 1024)
Y = np.random.randn(1024, 1000)
def timer(f):
return min(timeit.Timer(lambda: f(X, Y)).repeat(3, 5))
# print('Python:', timer(matmul_trivial)) we will not consider this since it takes infinite time :)
#print('Numpy:', timer(np.matmul))
#print('Numba:', timer(matmul_numba))
print('CUDA:', timer(matmul_gpu))
if __name__ == '__main__':
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda-9.0/nvvm/libdevice/'
matmul_comparison()