compute capability and calling a kernel from a kernel

Question

In order to know my compute capability, I have this small program :

cudaDeviceProp prop;
int count;
cudaGetDeviceCount( &count );
for (int i=0; i< count; i++) {
    cudaGetDeviceProperties( &prop, i );
    printf( "Compute capability:  %d.%d\n", prop.major, prop.minor );
}

and that prints 3.5 for all my GPUs.

Now I try to compile the following toy program (by nvcc -c):

__global__ void add_device(float *a, float *b, float *c, int n) 
{
    int i = blockIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

__global__ void add_kernel(float *a, float *b, float *c, int n) 
{
    add_device(a, b, c, n);
}

void gpu_add(float *a, float *b, float *c, int n) 
{
     add_kernel<<<n, 1>>>( a, b, c, n );
}

but my compilation results in the following error :

calling a __global__ function("add_device") from a __global__ function("add_kernel") is only allowed on the compute_35 architecture or above

What am I doing wrong?

The default compilation architecture for CUDA is sm_20. You need to manually specify the correct architecture to the compiler — talonmies, Oct 21 '15 at 11:05
Thanks. Now I use : /nvcc --gpu-architecture=compute_35 -c and I have another error : kernel launch from __device__ or __global__ functions requires separate compilation mode — Alberto Contador, Oct 21 '15 at 11:12
Please read the answers in the linked duplicate. How to compile with the correct architecture and separate compilation mode enabled are shown there — talonmies, Oct 21 '15 at 11:18
Thanks a lot. Your help is really very precious. I tried that thread, but got some link errors (undefined reference ..). It worked by adding -lcudadevrt at the end of c++ compilation as explained here : http://stackoverflow.com/questions/22115197/dynamic-parallelism-undefined-reference-to-cudaregisterlinkedbinary-linking I put these here in case it may be useful to other people. If you do not mind, may you replace the link you gave by the new one? Thanks a lot — Alberto Contador, Oct 21 '15 at 12:18
The CUDA toolkit install provides [sample codes](http://docs.nvidia.com/cuda/cuda-samples/index.html#abstract). Those sample codes include makefiles or visual studio solution files. You can also take a look at any sample code that uses dynamic parallelism, and look at the makefile or visual studio solution to find the correct setup for compiling a dynamic parallelism code. — Robert Crovella, Oct 21 '15 at 13:30
@AlbertoContador: The top voted answer in the duplicate includes linking of the device runtime library and provides examples both for single and multiple file compilation using the separate compilation and linkage model. I don't think anything is missing from that answer. — talonmies, Oct 21 '15 at 13:51

compute capability and calling a kernel from a kernel

0 Answers0