I have the following minimal non-working example:
#include <cstdio>
#include <cuda_runtime_api.h>
/* this declaration would normally be in a header, but it doesn't matter */
template<class T_PREC> __global__ void testKernel( T_PREC );
template<class T_PREC> __global__ void testKernel( T_PREC const x )
{
printf( "%f", x );
}
int main()
{
printf("calling kernel...");
testKernel<<<1,1>>>( 3.0f );
cudaDeviceSynchronize();
printf("OK\n");
return 0;
}
which I compile and run with
nvcc simple.cu && ./a.out
The output is:
calling kernel...
meaning the program crashes before it can print neither the "OK" nor the floating point number. It isn't a segmentation fault, so I can't backtrace anything. I'm using CUDA 7.0.27. When running this in gdb
the message is:
[Inferior 1 (process 27899) exited with code 01]
There are four things which make the above example work:
don't use CUDA:
template<class T_PREC> void testKernel( T_PREC ); template<class T_PREC> void testKernel( T_PREC const x ) { printf( "%f", x ); } int main() { printf("calling kernel..."); testKernel( 3.0f ); cudaDeviceSynchronize(); printf("OK\n"); return 0; }
don't use templates:
__global__ void testKernel( float ); __global__ void testKernel( float const x ) { printf( "%f", x ); }
omit the declaration (not an option if I need to make it available from a library)
//template<class T_PREC> void testKernel( T_PREC );
don't omit the
const
qualifier in the declaration:template<class T_PREC> __global__ void testKernel( T_PREC const );
this is the most reasonable option, but I don't see why I have to. In normal C++ a
const
qualifier for a call-by-value shouldn't change the function signature. And even if it did, it shouldn't link and only crash on execution. So why does CUDA behave differently and why only with templates?
Additional examinations:
As the assembler code itself is too hard for me, I looked at the executable created:
nvcc sameTypeQualifier/main.cu -o same.o
[no warning output whatsoever]
nvcc diffTypeQualifier/main.cu -o diff.o
diffTypeQualifier/main.cu: In instantiation of ‘void __wrapper__device_stub_testKernel(T_PREC* const&) [with T_PREC = float]’:
diffTypeQualifier/main.cu:8:45: required from ‘void testKernel(T_PREC*) [with T_PREC = float]’
diffTypeQualifier/main.cu:15:67: required from here
diffTypeQualifier/main.cu:7:86: warning: unused parameter ‘x’ [-Wunused-parameter]
template<class T_PREC> __global__ void testKernel( T_PREC * const x )
^
diff <(nm -C same.o | sed 's/^[0-9a-f]*//') <(nm -C diff.o | sed 's/^[0-9a-f]*//')
389a390
> t void __wrapper__device_stub_testKernel<float>(float*&)
419c420
< t __sti____cudaRegisterAll_39_tmpxft_000050c8_00000000_9_main_cpp1_ii_main()
---
> t __sti____cudaRegisterAll_39_tmpxft_0000511c_00000000_9_main_cpp1_ii_main()
The hexadecimal number in cudaRegisterAll which differs even differs between two compiles of identical source codes, so it can be ignored. Meaning for some reason the non-working example has an additional wrapper-function.