NB - see 2nd comment below; actually underlying problem is a basic error with sync, duplicating CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched), as @RobertCrovella points out.
However, may be useful for googlers, because the error message I first got suggests something different.
I'm launching a cooperative group grid using numba CUDA, and am getting
numba.cuda.cudadrv.driver.CudaAPIError: [720] Call to cuLaunchCooperativeKernel results in CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE
I'm wondering how to debug this. The code is just a unit test, with a grid of one (1x1) 32x32 block (which I get a warning for being too small). I have gotten this error in the past by stepping outside of array bounds, but I have eliminated that possibility in this case. It is during an array access that I get the error - if I comment out the access, I don't get it.
The code runs well in CUDASIM & everything is in-bounds.
It is unfortunately too complex (and proprietary) for me to paste as a whole. I am looking for a good debugging strategy first of all - and/or other things to check. (I will try to concoct a simpler example as well.)
[What I think may be happening is that the cooperative group code, which is relatively new in numba, is catching an error it shouldn't be catching....]
UPDATE
In fact, since I only have one block in this trivial example, I don't need to have a cooperative group at all. I've commented out the cooperative group code, and have the following case, which hangs:
@cuda.jit(device=True)
def infer(
minibatch: MiniBatchState,
config: "TrainingConfig",
model: "ModelData",
train: "TrainingState",
) -> None:
i_off_sample: int
i_off_edge: int
i_off_sample, i_off_edge = cuda.grid(2) # type: ignore
i_sample = minibatch.i_start_sample + i_off_sample
n_samples = train.data.features.shape[0]
# g: cuda.cg.GridGroup = cuda.cg.this_grid() # type: ignore
if (
i_sample >= n_samples
or i_off_sample > config.minibatch_size
or i_off_edge >= config.model_width
):
for i_level in range(model.n_levels):
# g.sync()
# g.sync()
cuda.syncthreads()
cuda.syncthreads()
return
else:
for i_level in range(model.n_levels):
# g.sync()
# g.sync()
cuda.syncthreads()
cuda.syncthreads()
return
The arguments are all named tuples or arrays. This hangs. However, If I make the if contain a "True", then it doesn't hang! Am I misunderstanding synchronization? I thought each thread must synchronize the same # of times, not that each thread must synchronize using exactly the same line of code!?