Memory allocation is not permitted when running kernel with cudaLaunchCooperativeKernel and -rdc=true

Question

An error "operation not permitted" is generated when running the following code. Is there anything I am missing? I'm running it with compute capabilities 7.5 and the command nvcc test.cu -rdc=true. It works without RDC.

#include <cooperative_groups.h>
#include <iostream>

__global__ void kernel() {
  void* x;
  cudaMalloc(&x, sizeof(int));
}

int main() {
  int dev = 0;
  int supportsCoopLaunch = 0;
  cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);
  if(supportsCoopLaunch == 0) {
    std::cout << "Device does not support cooperative launch, required to synchronize globally on the grid." << std::endl;
    return 0;
  }

  void* args[] = {};
  dim3 dimBlock(1, 1, 1);
  dim3 dimGrid(2, 1, 1);
  cudaError_t e = cudaLaunchCooperativeKernel((void*)kernel, dimGrid, dimBlock, args);
  if (e != cudaSuccess) {
    printf("CUDA runtime error %s\n", cudaGetErrorString(e));
  }
  cudaDeviceSynchronize();
  return 0;
}
``

Also, the documentation for cudaLaunchCooperativeKernel says it doesn't return cudaErrorNotPermitted, meaning that the error is likely not what you think it is, nor is it being produced where you think it is. Why don't you add proper error checking to *all* the API calls in your code? — talonmies, Apr 06 '23 at 10:25
See [What is the canonical way to check for errors using the CUDA runtime API?](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). — paleonix, Apr 06 '23 at 10:38
@talonmies actually the doc says "Note that as specified by cudaStreamAddCallback no CUDA function may be called from callback. cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case. " I don't think it is relevant to my code though. — Pierre T., Apr 06 '23 at 11:34
I'm going to update the post, but actually the error is only generated with the option `-rdc=true` — Pierre T., Apr 06 '23 at 11:34
CUDA has a documented limitation that cooperative groups [cannot launch a CDP kernel](https://stackoverflow.com/questions/70830289/can-i-use-cooperative-groupssyncgrid-in-child-kernel-cuda-dynamic-paralleli). You are using the runtime API within your kernel (ie. device API) and this is [formally documented in the CDP section](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#api-reference). My suggestion would be to use in-kernel `new` or `malloc`, which should have equivalent behavior. Both are possible in a CG kernel like yours. — Robert Crovella, Apr 06 '23 at 14:46
Thanks. Using `malloc` seems to solve the problem, however I don't understand what dynamic parallelism has to do with the code shown above. Is `cudaLaunchCooperativeKernel` considered to be dynamic parallelism, or use it behind the stage? — Pierre T., Apr 07 '23 at 08:55
No, coop. launch and CDP are mutually exclusive. And for some reason using the `cudaMalloc` (i.e. the runtime API) inside a kernel counts as CDP, it seems. — paleonix, Apr 07 '23 at 10:18
I'm not suggesting there is any connection between CDP and your code. I was advancing some guesswork to try and frame this odd behavior. If you feel badly about using the completely equivalent `malloc` or `new` in place of `cudaMalloc` in your example, I suggest [filing a bug](https://forums.developer.nvidia.com/t/how-to-report-a-bug/67911). — Robert Crovella, Apr 07 '23 at 13:44

score 0 · Accepted Answer · answered Apr 10 '23 at 08:24

0

Use malloc instead of cudaMalloc. This is probably due to a bug (see comments by Robert Crovella above).

answered Apr 10 '23 at 08:24

Pierre T.

380
1
13

Memory allocation is not permitted when running kernel with cudaLaunchCooperativeKernel and -rdc=true

1 Answers1