what causes CUDA_ERROR_LAUNCH_TIMEOUT in python numba?

Question

I am experimenting how to use cuda inside numba. However there is something occurred in numba that I do not understand. Here is my code

from numba import cuda
@cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
d=cuda.local.array((3,3),dtype=numba.float64)
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
    tmp = 0.
    for k in range(A.shape[1]):
        tmp += A[i, k] * B[k, j]
    C[i, j] = tmp

This is the matrix function I self defined for testing using numba.cuda. Before running the tests, I also loaded the arrays in the following code:

import numpy as np
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
c=np.empty((2000,2000))
a1=cuda.to_device(a)
b1=cuda.to_device(b)
c1=cuda.to_device(c)

Then I used the following code for experiment:

from time import time
count =0
start=time()
for i in range(2000):
  matmul[(256,256),(16,16)](a1,b1,c1)
  count +=1
  print(count)

The for loops ran just fine for the first 1028 runs. However it gave me an error at the 1029th run and the error message is as the following:

Here is my cuda information called from numba.cuda

from numba import cuda
gpu = cuda.get_current_device()
print("name = %s" % gpu.name)
print("maxThreadsPerBlock = %s" % str(gpu.MAX_THREADS_PER_BLOCK))
print("maxBlockDimX = %s" % str(gpu.MAX_BLOCK_DIM_X))
print("maxBlockDimY = %s" % str(gpu.MAX_BLOCK_DIM_Y))
print("maxBlockDimZ = %s" % str(gpu.MAX_BLOCK_DIM_Z))
print("maxGridDimX = %s" % str(gpu.MAX_GRID_DIM_X))
print("maxGridDimY = %s" % str(gpu.MAX_GRID_DIM_Y))
print("maxGridDimZ = %s" % str(gpu.MAX_GRID_DIM_Z))
print("maxSharedMemoryPerBlock = %s" % 
str(gpu.MAX_SHARED_MEMORY_PER_BLOCK))
print("asyncEngineCount = %s" % str(gpu.ASYNC_ENGINE_COUNT))
print("canMapHostMemory = %s" % str(gpu.CAN_MAP_HOST_MEMORY))
print("multiProcessorCount = %s" % str(gpu.MULTIPROCESSOR_COUNT))
print("warpSize = %s" % str(gpu.WARP_SIZE))
print("unifiedAddressing = %s" % str(gpu.UNIFIED_ADDRESSING))
print("pciBusID = %s" % str(gpu.PCI_BUS_ID))
print("pciDeviceID = %s" % str(gpu.PCI_DEVICE_ID))

and the output is:

name = b'GeForce GTX 1050 Ti'

maxThreadsPerBlock = 1024

maxBlockDimX = 1024

maxBlockDimY = 1024

maxBlockDimZ = 64

maxGridDimX = 2147483647

maxGridDimY = 65535

maxGridDimZ = 65535

maxSharedMemoryPerBlock = 49152

asyncEngineCount = 2

canMapHostMemory = 1

multiProcessorCount = 6

warpSize = 32

unifiedAddressing = 1

pciBusID = 3

pciDeviceID = 0

Can you check the memory usage? I have a felling that the problem may be the `tmp` — not_a_bot_no_really_82353, Sep 08 '18 at 21:38
It usually means that you are running the GPU kernel on a GPU that is also hosting a display. Such kernels are limited to about 2 seconds of duration, or you will get a timeout. This kernel seems to take more than 2 seconds to run. The reason you get up to 1028 is that the kernel launches are asynchronous, deposited into a queue, so you're witnessing queue depth. If you are on windows, google WDDM TDR If you are on linux, see [here](https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x). — Robert Crovella, Sep 08 '18 at 21:38
I have checked the GPU memory it does not change much. Only like about 2%. — Peter Deng, Sep 08 '18 at 21:41
@Robert Crovella IS there anyway to fix that without changing number of for loops to run? — Peter Deng, Sep 08 '18 at 21:43
@RobertCrovella I am using windows. What exactly should I do? I am very new to cuda programming. — Peter Deng, Sep 08 '18 at 21:46
you should google "cuda wddm tdr" and start reading. The very first link will be instructive, although you may not be using visual studio, it's a good overview. — Robert Crovella, Sep 08 '18 at 21:50
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/179691/discussion-between-peter-deng-and-robert-crovella). — Peter Deng, Sep 08 '18 at 22:01
OP is using a school computer and does not have admin privilege so is unable to change WDDM TDR timeout period. Another option is to reduce the kernel duration to less than 2 seconds, e.g. by reducing problem size in blocks from (256,256) to (16,16) — Robert Crovella, Sep 08 '18 at 22:50
I have a bit out of scope question, what is the use of d=cuda.local.array((3,3),dtype=numba.float64) — hend, Sep 13 '20 at 12:51

what causes CUDA_ERROR_LAUNCH_TIMEOUT in python numba?

0 Answers0