0

NB - see 2nd comment below; actually underlying problem is a basic error with sync, duplicating CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched), as @RobertCrovella points out.

However, may be useful for googlers, because the error message I first got suggests something different.


I'm launching a cooperative group grid using numba CUDA, and am getting

numba.cuda.cudadrv.driver.CudaAPIError: [720] Call to cuLaunchCooperativeKernel results in CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE

I'm wondering how to debug this. The code is just a unit test, with a grid of one (1x1) 32x32 block (which I get a warning for being too small). I have gotten this error in the past by stepping outside of array bounds, but I have eliminated that possibility in this case. It is during an array access that I get the error - if I comment out the access, I don't get it.

The code runs well in CUDASIM & everything is in-bounds.

It is unfortunately too complex (and proprietary) for me to paste as a whole. I am looking for a good debugging strategy first of all - and/or other things to check. (I will try to concoct a simpler example as well.)

[What I think may be happening is that the cooperative group code, which is relatively new in numba, is catching an error it shouldn't be catching....]

UPDATE

In fact, since I only have one block in this trivial example, I don't need to have a cooperative group at all. I've commented out the cooperative group code, and have the following case, which hangs:

@cuda.jit(device=True)
def infer(
    minibatch: MiniBatchState,
    config: "TrainingConfig",
    model: "ModelData",
    train: "TrainingState",
) -> None:
    i_off_sample: int
    i_off_edge: int
    i_off_sample, i_off_edge = cuda.grid(2)  # type: ignore
    i_sample = minibatch.i_start_sample + i_off_sample
    n_samples = train.data.features.shape[0]
    # g: cuda.cg.GridGroup = cuda.cg.this_grid()  # type: ignore
    if (
        i_sample >= n_samples
        or i_off_sample > config.minibatch_size
        or i_off_edge >= config.model_width
    ):
        for i_level in range(model.n_levels):
            # g.sync()
            # g.sync()
            cuda.syncthreads()
            cuda.syncthreads()
        return
    else:
        for i_level in range(model.n_levels):
            # g.sync()
            # g.sync()
            cuda.syncthreads()
            cuda.syncthreads()
        return

The arguments are all named tuples or arrays. This hangs. However, If I make the if contain a "True", then it doesn't hang! Am I misunderstanding synchronization? I thought each thread must synchronize the same # of times, not that each thread must synchronize using exactly the same line of code!?

shaunc
  • 5,317
  • 4
  • 43
  • 58
  • I would probably start by running the code under `compute-sanitizer`. – Robert Crovella Jan 12 '23 at 04:42
  • 3
    a `__syncthreads()` in conditional code is illegal unless all threads in the threadblock evaluate the condition exactly the same. This has been true since day 1 in CUDA. It's covered in the programming guide, [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions): "`__syncthreads()` is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects." – Robert Crovella Jan 12 '23 at 04:58
  • @RobertCrovella - thanks! My misunderstanding. If you write that up as an answer I will accept it. – shaunc Jan 12 '23 at 14:04

0 Answers0