Can I launch a cooperative kernel without passing an array of pointers?

Question

The CUDA runtime API allows us to launch kernels using the variable-number-of-arguments triple-chevron syntax:

my_kernel<<<grid_dims, block_dims, shared_mem_size>>>(
    first_arg, second_arg, and_as_many, as_we, want_to, etc, etc);

but as regards "collaborative" kernels, the CUDA Programming Guide says (section C.3):

To enable grid synchronization, when launching the kernel it is necessary to use, instead of the <<<...>>> execution configuration syntax, the cuLaunchCooperativeKernel CUDA runtime launch API:
cudaLaunchCooperativeKernel(
  const T *func,
  dim3 gridDim,
  dim3 blockDim,
  void **args,
  size_t sharedMem = 0,
  cudaStream_t stream = 0
)      
(or the CUDA driver equivalent).

I would rather not have to write my own wrapper code for building an array of pointers... is there really no facility in the runtime API to avoid that?

einpoklum · Accepted Answer · 2022-04-25T17:56:37.920

We can use something like the following workaround (requires --std=c++11 or a more recent C++ language standard):

namespace detail {

template <typename F, typename... Args>
void for_each_argument_address(F f, Args&&... args) {
    [](...){}((void)(f( (void*) &std::forward<Args>(args) ), 0)...);
}

} // namespace detail

template<typename KernelFunction, typename... KernelParameters>
inline void cooperative_launch(
    const KernelFunction&       kernel_function,
    stream::id_t                stream_id,
    launch_configuration_t      launch_configuration,
    KernelParameters...         parameters)
{
    void* arguments_ptrs[sizeof...(KernelParameters)];
    auto arg_index = 0;
    detail::for_each_argument_address(
        [&](void * x) {arguments_ptrs[arg_index++] = x;},
        parameters...);
    cudaLaunchCooperativeKernel<KernelFunction>(
        &kernel_function,
        launch_configuration.grid_dimensions,
        launch_configuration.block_dimensions,
        arguments_ptrs,
        launch_configuration.dynamic_shared_memory_size,
        stream_id);
}

Note: This uses Sean Parent's classic for_each_arg() one-liner. See also this post about it at FluentCPP.

You're welcome, @MichaelHoney . You may also want to check out my [cuda-api-wrappers](https://github.com/eyalroz/cuda-api-wrappers), to make your life easier all around working with the APIs. — einpoklum, Jun 26 '21 at 21:53

Andrei Pokrovsky · Answer 2 · 2018-06-02T03:18:00.823

4

FWIW you can pass arbitrary structs (not immediately obvious from API docs) by just passing it via void* args. It's not obvious that the sizeof gets computed by the compiler in this case from the function signature and the right size is copied to the kernel. The API docs don't seem to elaborate on that.

struct Param { int a, b; void* device_ptr; };
Param param{aa, bb, d_ptr};
void *kArgs = {&param};
cudaLaunchCooperativeKernel(..., kArgs, ...);

edited Jun 02 '18 at 03:18

answered Jun 01 '18 at 02:06

Andrei Pokrovsky

3,590
3
26
17

But this means changing the kernel signatures... doesn't it? – einpoklum Jun 01 '18 at 07:20
Correct. I still think this answer is helpful here in that it clarifies the API and also you can pack everything you need inside the params struct including all the pointers and other params, which can often be a good practice for kernels with lots of arguments. – Andrei Pokrovsky Jun 02 '18 at 03:16

talonmies · Answer 3 · 2018-02-01T11:42:44.707

The answer is no.

Under the hood, the <<< >>> syntax gets expanded like this:

deviceReduceBlockKernel0<<<nblocks, 256>>>(input, scratch, N);

becomes:

(cudaConfigureCall(nblocks, 256)) ? (void)0 : deviceReduceBlockKernel0(input, scratch, N);

and a boilerplate wrapper function gets emitted:

void deviceReduceBlockKernel0(int *in, int2 *out, int N) ;

// ....

void deviceReduceBlockKernel0( int *__cuda_0,struct int2 *__cuda_1,int __cuda_2)
{
__device_stub__Z24deviceReduceBlockKernel0PiP4int2i(_cuda_0,__cuda_1,__cuda_2);
}

void __device_stub__Z24deviceReduceBlockKernel1P4int2Pii( struct int2 *__par0,  int *__par1,  int __par2) 
{  
    __cudaSetupArgSimple(__par0, 0UL); 
    __cudaSetupArgSimple(__par1, 8UL); 
    __cudaSetupArgSimple(__par2, 16UL); 
    __cudaLaunch(((char *)((void ( *)(struct int2 *, int *, int))deviceReduceBlockKernel1))); 
}

ie. the toolchain is just automagically doing what you would have to do yourself by hand (or via fancy generator templates) in code when you explicitly use the kernel launch APIs, be they the conventional single launch or new cooperative launch APIs. In the deprecated version of the APIs, there is an internal stack which does the dirty work for you. In the newer APIs, you make arrays of arguments yourself. Same thing, just different dog food.

Can I launch a cooperative kernel without passing an array of pointers?

3 Answers3