CUDA kernel with function pointer and variadic templates

Question

I am trying to design a cuda framework which would accept user functions and forward them to the kernel, through device function pointers. CUDA can work with variadic templates (-stc=c++11) and so far so good.

However, I hit a problem when the kernel calls the device function pointer. Apparently the kernel runs with no problem, but the GPU usage is 0%. If I simply replace the callback pointer with the actual function then GPU usage is 99%. The code here is very simple and the large loop range is simply to make things measurable. I measured the gpu status with:

nvidia-smi --query-gpu=utilization.gpu,utilization.mory,memory.used --format=csv -lms 100 -f out.txt

IIRC, the user function needs to be in the same file unit as the kernel (#included perhaps) in order to nvcc succeed. The func_d is right there in the source and it compiles and runs fine, well besides not working with the function pointer (which is the whole point in this design).

My question is: Why the kernel with the callback device function pointer is not working?

Note that, when I printf noth the callback and func_d addresses, they are the same, as in this sample output:

size of Args = 1
callback() address = 4024b0
func_d()   address = 4024b0

Another weird thing is, if one uncomments the callback() call in kernel() then GPU usage is back to 0%, even with the func_d() call still in there... The func_d version takes about 4 seconds to run, whereas the callback version takes nothing (well, ~0.1sec).

System specs and compilation command are in the head of the code below.

Thanks!

// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014


#include <stdio.h>

__device__
void func_d(double* vol)
{
*vol += 5.4321f;
}


// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
double val0 = 1.2345f;

//  // does not use gpu (0% gpu utilization)
//  for ( int i = 0; i < 1000000; i++ ) {
//  callback( &val0 );
//  }

// uses gpu (99% gpu utilization)
for ( int i = 0; i < 10000000; i++ ) {
func_d( &val0 );
}
}


// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
constexpr int I = sizeof...(Types);
printf("size of Args = %d\n",I);

printf("callback() address = %x\n",callback);
printf("func_d()   address = %x\n",func_d);

dim3 nblocks = 100;
int nthread = 100;
kernel<Types...><<<nblocks,nthread>>>( callback );
}


__host__
int main(int argc, char** argv)
{
host_func(func_d);
}

Hope this answer helps you. http://stackoverflow.com/a/9001502/749973 — Tae-Sung Shin, Nov 04 '14 at 23:26

score 3 · Accepted Answer · edited May 23 '17 at 12:08

My question is: Why the kernel with the callback device function pointer is not working?

There are probably several issues to address. But the simplest answer is because it is illegal to take the address of device entities in host code. This is true for device variables as well as device functions. Now, you can take the address of those entities. But the address is garbage. It is not usable either on the host or on the device. If you attempt to use them anyway, you'll get undefined behavior on the device, which will usually bring your kernel to a halt.

Host addresses may be observed in host code. Device addresses may be observed in device code. Any other behavior requires API intervention.

You appear to be using the nvidia-smi utilization query as a measure of whether or not things are running correctly. I would suggest doing proper cuda error checking instead, and also you may wish to run your code with cuda-memcheck.
"Why then does the address of func_d match the address of callback?" Because you are taking both addresses in host code, and both addresses are garbage. To convince yourself of this, add a line something like this at the very end of your kernel:
```
if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d()   address = %x\n",func_d);
```
and you will see that it prints out something different from what is being printed on the host.
"What about the device utilization?" As soon as the device encounters an error, the kernel terminates, and utilization goes to zero. Hopefully this will explain this statement for you: "Another weird thing is, if one uncomments the callback() call in kernel() then GPU usage is back to 0%, even with the func_d() call still in there... "
"How can I fix this?" I don't know of a great way to fix this. If you have a limited number of CUDA functions known at compile-time, that you want the user to be able to select from, then the appropriate thing is probably to just create an appropriate index, and use that to select the function. If you really want to, you can run a preliminary/setup kernel, which will take the address of functions you care about, and then you can pass these addresses back to host code, and use them in subsequent kernel calls as parameters, and this should allow your mechanism to work. But I don't see how it prevents the need to index through a set of pre-defined functions known at compile-time. If the direction you are headed in is that you want the user to be able to provide user-defined functions at runtime I think you will find this quite difficult to do at the moment with the CUDA runtime API (I suspect this is likely to change in the future.) I provided a rather contorted mechanism to try to do this here (read the whole question and answer; talonmies answer there is informative as well). If, on the other hand, you are willing to use the CUDA driver API, then it should be possible, although somewhat involved, since this is exactly what is done in a very elegant fashion in PyCUDA, for example.
In the future, please indent your code.

Here's a fully worked example, demonstrating a few of the ideas above. In particular, I am showing in a rather crude fashion, that the func_d address can be taken in device code, then passed back to the host, then used as a future kernel parameter to successfully select/call that device function.

$ cat t595.cu
// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014


#include <stdio.h>

__device__
void func_d(double* vol)
{
  if ((!threadIdx.x) && (!blockIdx.x)) printf("value = %f\n", *vol);
  *vol += 5.4321f;
}

template <typename... Types>
__global__ void setup_kernel(void (**my_callback)(Types*...)){
  *my_callback = func_d;
}

// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
  double val0 = 1.2345f;

//  // does not use gpu (0% gpu utilization)
//  for ( int i = 0; i < 1000000; i++ ) {
  callback( &val0 );
//  }

  val0 = 0.0f;
// uses gpu (99% gpu utilization)
//  for ( int i = 0; i < 10000000; i++ ) {
    func_d( &val0 );
//  }
  if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d()   address = %x\n",func_d);
}


// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
  constexpr int I = sizeof...(Types);
  printf("size of Args = %d\n",I);

  printf("callback() address = %x\n",callback);
  printf("func_d()   address = %x\n",func_d);

  dim3 nblocks = 100;
  int nthread = 100;
  unsigned long long *d_callback, h_callback;
  cudaMalloc(&d_callback, sizeof(unsigned long long));
  setup_kernel<<<1,1>>>((void (**)(Types*...))d_callback);
  cudaMemcpy(&h_callback, d_callback, sizeof(unsigned long long), cudaMemcpyDeviceToHost);
  kernel<Types...><<<nblocks,nthread>>>( (void (*)(Types*...))h_callback );
  cudaDeviceSynchronize();
}


__host__
int main(int argc, char** argv)
{
  host_func(func_d);
}
$ nvcc -std=c++11 -arch=sm_20 -o t595 t595.cu
$ cuda-memcheck ./t595
========= CUDA-MEMCHECK
size of Args = 1
callback() address = 4025dd
func_d()   address = 4025dd
value = 1.234500
value = 0.000000
in-kernel func_d()   address = 4
========= ERROR SUMMARY: 0 errors
$

I appretiate your reply. I was not aware to the fact that if a kernel gets invalid input it will quietly terminate. That's why I got initially confused. Your reply is spot-on on the problem (namely device/host memory). I happened to look over CUDA SDK "simpleSeparateCompilation" sample which also uses function pointers. The thing is that, as you pointed out, there must be a device function pointer assignment at **compile time**. I am looking for tools like _cproto_ to get the user function prototype to explicitly instantiate the template and latter do the proper setup. Thanks! — Brevirt, Nov 08 '14 at 22:29
quick ones: Why did you set `*d_callback` as `unsigned long long`? Is that also why you cast it to `(void(**)(Types*...))`, as well as `h_callback`? — Brevirt, Nov 08 '14 at 22:33
No good reason. I was just being crude and lazy. You would not do this with good code, but then there would also be no reason to pass device function addresses to the host and then back to the device again. — Robert Crovella, Nov 08 '14 at 22:37

CUDA kernel with function pointer and variadic templates

1 Answers1

Linked