I am trying to design a cuda framework which would accept user functions and forward them to the kernel, through device function pointers. CUDA can work with variadic templates (-stc=c++11) and so far so good.
However, I hit a problem when the kernel calls the device function pointer. Apparently the kernel runs with no problem, but the GPU usage is 0%. If I simply replace the callback pointer with the actual function then GPU usage is 99%. The code here is very simple and the large loop range is simply to make things measurable. I measured the gpu status with:
nvidia-smi --query-gpu=utilization.gpu,utilization.mory,memory.used --format=csv -lms 100 -f out.txt
IIRC, the user function needs to be in the same file unit as the kernel (#included perhaps) in order to nvcc succeed. The func_d is right there in the source and it compiles and runs fine, well besides not working with the function pointer (which is the whole point in this design).
My question is: Why the kernel with the callback device function pointer is not working?
Note that, when I printf noth the callback and func_d addresses, they are the same, as in this sample output:
size of Args = 1
callback() address = 4024b0
func_d() address = 4024b0
Another weird thing is, if one uncomments the callback()
call in kernel()
then GPU usage is back to 0%, even with the func_d()
call still in there... The func_d version takes about 4 seconds to run, whereas the callback version takes nothing (well, ~0.1sec).
System specs and compilation command are in the head of the code below.
Thanks!
// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014
#include <stdio.h>
__device__
void func_d(double* vol)
{
*vol += 5.4321f;
}
// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
double val0 = 1.2345f;
// // does not use gpu (0% gpu utilization)
// for ( int i = 0; i < 1000000; i++ ) {
// callback( &val0 );
// }
// uses gpu (99% gpu utilization)
for ( int i = 0; i < 10000000; i++ ) {
func_d( &val0 );
}
}
// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
constexpr int I = sizeof...(Types);
printf("size of Args = %d\n",I);
printf("callback() address = %x\n",callback);
printf("func_d() address = %x\n",func_d);
dim3 nblocks = 100;
int nthread = 100;
kernel<Types...><<<nblocks,nthread>>>( callback );
}
__host__
int main(int argc, char** argv)
{
host_func(func_d);
}