1

I have the following minimal non-working example:

#include <cstdio>
#include <cuda_runtime_api.h>

/* this declaration would normally be in a header, but it doesn't matter */
template<class T_PREC> __global__ void testKernel( T_PREC );

template<class T_PREC> __global__ void testKernel( T_PREC const x )
{
    printf( "%f", x );
}

int main()
{
    printf("calling kernel...");
    testKernel<<<1,1>>>( 3.0f );
    cudaDeviceSynchronize();
    printf("OK\n");
    return 0;
}

which I compile and run with

nvcc simple.cu && ./a.out

The output is:

calling kernel...

meaning the program crashes before it can print neither the "OK" nor the floating point number. It isn't a segmentation fault, so I can't backtrace anything. I'm using CUDA 7.0.27. When running this in gdb the message is:

[Inferior 1 (process 27899) exited with code 01]

There are four things which make the above example work:

  • don't use CUDA:

    template<class T_PREC> void testKernel( T_PREC );
    
    template<class T_PREC> void testKernel( T_PREC const x )
    {
        printf( "%f", x );
    }
    
    int main()
    {
        printf("calling kernel...");
        testKernel( 3.0f );
        cudaDeviceSynchronize();
        printf("OK\n");
        return 0;
    }
    
  • don't use templates:

    __global__ void testKernel( float );
    
    __global__ void testKernel( float const x )
    {
        printf( "%f", x );
    }
    
  • omit the declaration (not an option if I need to make it available from a library)

    //template<class T_PREC> void testKernel( T_PREC );
    
  • don't omit the const qualifier in the declaration:

    template<class T_PREC> __global__ void testKernel( T_PREC const );
    

    this is the most reasonable option, but I don't see why I have to. In normal C++ a const qualifier for a call-by-value shouldn't change the function signature. And even if it did, it shouldn't link and only crash on execution. So why does CUDA behave differently and why only with templates?


Additional examinations:

As the assembler code itself is too hard for me, I looked at the executable created:

nvcc sameTypeQualifier/main.cu -o same.o
    [no warning output whatsoever]

nvcc diffTypeQualifier/main.cu -o diff.o
    diffTypeQualifier/main.cu: In instantiation of ‘void __wrapper__device_stub_testKernel(T_PREC* const&) [with T_PREC = float]’:
    diffTypeQualifier/main.cu:8:45:   required from ‘void testKernel(T_PREC*) [with T_PREC = float]’
    diffTypeQualifier/main.cu:15:67:   required from here
    diffTypeQualifier/main.cu:7:86: warning: unused parameter ‘x’ [-Wunused-parameter]
             template<class T_PREC> __global__ void testKernel( T_PREC * const x )
                                                                                          ^
 diff <(nm -C same.o | sed 's/^[0-9a-f]*//') <(nm -C diff.o | sed 's/^[0-9a-f]*//')
    389a390
    >  t void __wrapper__device_stub_testKernel<float>(float*&)
    419c420
    <  t __sti____cudaRegisterAll_39_tmpxft_000050c8_00000000_9_main_cpp1_ii_main()
    ---
    >  t __sti____cudaRegisterAll_39_tmpxft_0000511c_00000000_9_main_cpp1_ii_main()

The hexadecimal number in cudaRegisterAll which differs even differs between two compiles of identical source codes, so it can be ignored. Meaning for some reason the non-working example has an additional wrapper-function.

mxmlnkn
  • 1,887
  • 1
  • 19
  • 26
  • 1
    It seems my observation (CUDA 7.5, MSVS 2010) is slightly different from yours: As long as the template declaration and the template definition match in their use of `const` (that is, either both have it or both don't have it), the kernel is invoked and prints the passed value as expected. I am not enough of a C++ language lawyer to say what should happen here; potentially this could also be a bug in the CUDA compiler but I am not willing to say so yet since I am not sure whether undefined behavior may be invoked by the code when `const` doesn't match between declaration and definition. – njuffa Jan 31 '16 at 00:35
  • @njuffa: I think these answers are related: http://stackoverflow.com/questions/18215686/compatible-types-and-argument-type-qualifiers Btw: the same problem happens with the `__restrict__` keyword on `float *`. It only works if the declaration also has it. – mxmlnkn Jan 31 '16 at 09:31
  • Note that question refers to C, while CUDA is a variant of C++. While C and C++ are closely related, there are numerous differences between them, – njuffa Jan 31 '16 at 15:31

2 Answers2

1

IMHO you are just misleading the compiler and that leads to undefined behavior. Actually (and I hope that someone with much deeper C++ knowledge will comment) you are telling the compiler that there will be a function accepting an int with your declaration and then making the compiler generate a function that accepts a constant int. From my point of view the compiler should tell you - no no no, there is an ambiguity here, I have an unresolved symbol and fail.

Why? Well, one possible scenario would be that due to some weird optimizations, since the definition takes a constant int that will not be modified and thus needs no storage it could be reduced to a compile time constant. On the other hand the code that is generated to make the call assumes it must pass it. I am not sure if it is so (and the best way to understand would be to disassemble this example) but I think such example is enough to doubt the correctness of the example itself.

Why not just keep the declaration and definition identical as you would do in any other case?

Rudolfs Bundulis
  • 11,636
  • 6
  • 33
  • 71
  • my intent was to strip the interface of unnecessary type qualifiers in order to clean it. `const` was only an example, but the problem also appears for `__restrict__` and `volatile`. But these type qualifiers should only be relevant to the implementation, for compiler error checking and optimization. It shouldn't have any meaning for the interface i.e. the declaration, at least in C that is: http://stackoverflow.com/questions/18215686/compatible-types-and-argument-type-qualifiers – mxmlnkn Jan 31 '16 at 09:47
  • @mxmlnkn why do you perceive the `const` qualifier as a sign of dirtiness? :) Again, as I said - the answer could be found out by examining the assembly. – Rudolfs Bundulis Jan 31 '16 at 22:29
0

When comparing the intermediate files something interesting can be discovered:

nvcc --keep [...]
colordiff -r c/ nc/

[...]
diff c/main.cu.cpp.ii nc/main.cu.cpp.ii
32767c32767
< template< class T_PREC> static void __wrapper__device_stub_testKernel(const T_PREC &); template< class T_PREC> void testKernel(const T_PREC);
---
> template< class T_PREC> static void __wrapper__device_stub_testKernel(T_PREC &); template< class T_PREC> void testKernel(T_PREC);
[...]
diff c/main.cudafe1.cpp nc/main.cudafe1.cpp
70764c70764
< template< class T_PREC> static void __wrapper__device_stub_testKernel(const T_PREC &); template< class T_PREC> void testKernel(const T_PREC);
---
> template< class T_PREC> static void __wrapper__device_stub_testKernel(T_PREC &); template< class T_PREC> void testKernel(T_PREC);
[...]

I stripped lines from the diff which were only differences like (const float) instead of (float)

It seems to me like there is a bug in nvcc when creating the intermediate wrapper functions for templated declarations. Because the types are copy pasted and changed to a call-by-reference the kernels itself may be identical, but not the wrapper call, because it makes a difference if it is a const call-by-reference or a non-const call-by-reference. Furthermore in my opinion it is a bug that for a mere declaration a wrapper call is created in the first place.

Here is a C++ example demonstrating the problem happening:

#include<cstdio>

void f( float const & x ) { printf( "float const &\n", x ); }
void f( float       & x ) { printf( "float &\n", x ); }

int main( void )
{
    f( 3.0 );
    float x = 3.0;
    f( x );
}

The output of the program is:

float const &
float &

When grepping for the wrapper function we find how the overloaded functions are defined and called:

grep -C20 '__wrapper__device_stub_testKernel' nc/main.cu.cpp.ii

and the output:

# 4 "main.cu"
template< class T_PREC> static void __wrapper__device_stub_testKernel(T_PREC &);
template< class T_PREC> void testKernel(T_PREC);
# 5 "main.cu"
template< class T_PREC> static void __wrapper__device_stub_testKernel(const T_PREC &x)
{
    exit(1);
}
# 5 "main.cu"
template< class T_PREC> void testKernel(const T_PREC x)
{
    # 6 "main.cu"
    __wrapper__device_stub_testKernel<T_PREC>(x);
    # 8 "main.cu"
    return;
}
# 10 "main.cu"
int main()
# 11 "main.cu"
{
    # 12 "main.cu"
    printf("calling kernel...");
    # 13 "main.cu"
    (cudaConfigureCall(1, 1)) ? (void)0 : (testKernel)((3.0F));
    # 14 "main.cu"
    cudaDeviceSynchronize();
    # 15 "main.cu"
    printf("OK\n");
    # 16 "main.cu"
    return 0;
    # 17 "main.cu"
}
[...]
static void __device_stub__Z10testKernelIfEvT_(float __par0)
{
    if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) 
        return;
    {
        volatile static char *__f __attribute__((unused));
        __f = ((char *)( (void ( *)(float))testKernel<float> ) );
        (void)cudaLaunch( ((char *)((void ( *)(float))testKernel<float> )) );
    };
}
[...]
template<> void __wrapper__device_stub_testKernel<float>( float &__cuda_0)
{
    __device_stub__Z10testKernelIfEvT_( __cuda_0);
}

(Note: I added some indentations and line breaks for better readability)

So while the non-const call-by-reference function calls presumable the kernel, the const call-by-reference overloaded function calls exit(1).

For some reason additional to the bug with the conversion of a const copy-by-value to a const call-by-reference, it seems nvcc mixed up the "two" kernel in the original file. The non-const declaration is translated to a wrapper function calling __device_stub__Z10testKernelIfEvT_ while the function definition with the const copy-by-value argument is translated to a wrapper calling exit(1).

Unfortunately I'm not a CUDA developer so I can't file a bug report, but maybe I convinced anyone to do it for me. Or maybe some feedback comments prove my answer to be wrong.

mxmlnkn
  • 1,887
  • 1
  • 19
  • 26