The CUDA runtime API allows us to launch kernels using the variable-number-of-arguments triple-chevron syntax:
my_kernel<<<grid_dims, block_dims, shared_mem_size>>>(
first_arg, second_arg, and_as_many, as_we, want_to, etc, etc);
but as regards "collaborative" kernels, the CUDA Programming Guide says (section C.3):
To enable grid synchronization, when launching the kernel it is necessary to use, instead of the
<<<...>>>
execution configuration syntax, thecuLaunchCooperativeKernel
CUDA runtime launch API:cudaLaunchCooperativeKernel( const T *func, dim3 gridDim, dim3 blockDim, void **args, size_t sharedMem = 0, cudaStream_t stream = 0 )
(or the CUDA driver equivalent).
I would rather not have to write my own wrapper code for building an array of pointers... is there really no facility in the runtime API to avoid that?