overload cuda kernel function

Question

I encountered a problem when using the overloaded kernel functions in CUDA.

I can understand CUDA can launch an overloaded function by its arguments.

However, if I would like to use cudaOccupancyMaxPotentialBlockSize() to calculate the block size for maximum occupancy, see doc.

__global__ void foo_cuda_kernel(int a)
{
  /*implementation 1*/
}

//overloaded kernel function
__global__ void foo_cuda_kernel(int a, int b)
{
  /*implementation 2*/
}

void foo_cuda()
{
  int min_grid_size, grid_size, block_size;
  cudaOccupancyMaxPotentialBlockSize
  (
    &min_grid_size, &block_size, 
    foo_cuda_kernel, //how does it distinguish overloaded functions?
    0, thread_num
  );
  grid_size = (thread_num + block_size - 1) / block_size;
  
  //I can understand compiler can distinguish the launched function by its arguments
  foo_cuda_kernel<<<grid_size, block_size>>>((int)1);
  cudaDeviceSynchronize();
}

How to make it works? How cudaOccupancyMaxPotentialBlockSize() distinguishes overloaded functions?

Does this answer your question? [How do I specify a pointer to an overloaded function?](https://stackoverflow.com/questions/2942426/how-do-i-specify-a-pointer-to-an-overloaded-function) — apple apple, Jan 20 '21 at 17:34

talonmies · Answer 1 · 2021-01-22T00:16:16.810

As noted in comments, you can cast the function to a pointer to the correct specialization:

auto foo_ii = static_cast<void (*)(int, int)>(&foo_cuda_kernel);
auto foo_i = static_cast<void (*)(int)>(&foo_cuda_kernel);

You then pass either foo_i or foo_ii to cudaOccupancyMaxPotentialBlockSize, depending on which version of the function you require.

This will work because the toolchain silently emits host boilerplate functions which wrap the underlying runtime API calls to run a kernel and enforce kernel argument type checking. The host compiler treats these wrappers like any other host function (because they are), and selects the matching version automagically.

overload cuda kernel function

1 Answers1