function as templated parameter in cuda

Question

I'm attempting to write a reduction function in cuda (this is an exercise, I know that I'm doing things which have been done better by other people) which takes a binary associative operator and an array and reduces the array using the operator.

I'm having difficulty with how to pass the function. I've written hostOp() as a host based example which works fine.

deviceOp() works for the first statement with an explicit call to fminf(), but when I call the function parameter, there is an illegal memory access error.

#include <iostream>
#include <cstdio>
#include <cmath>
using namespace std; //for brevity

__device__  float g_d_a = 9, g_d_b = 5;
float g_h_a = 9, g_h_b = 5;

template<typename argT, typename funcT>
__global__
void deviceOp(funcT op){    
    argT result = fminf(g_d_a, g_d_b);                  //works fine
    printf("static function result: %f\n", result);
    result = op(g_d_a,g_d_b);                           //illegal memory access
    printf("template function result: %f\n", result);
}

template<typename argT, typename funcT>                 
void hostOp(funcT op){
    argT result = op(g_h_a, g_h_b);
    printf("template function result: %f\n", result);
}

int main(int argc, char* argv[]){
    hostOp<float>(min<float>);                          //works fine
    deviceOp<float><<<1,1>>>(fminf);

    cudaDeviceSynchronize(); 
    cout<<cudaGetErrorString(cudaGetLastError())<<endl;
}

OUTPUT:

host function result: 5.000000
static function result: 5.000000
an illegal memory access was encountered

Assuming I'm not doing something horribly stupid, how should I be passing fminf to deviceOp so that there isn't an illegal memory access?

If I am doing something horribly stupid, what is a better way?

score 1 · Accepted Answer · edited May 23 '17 at 10:28

A function to be called on the device must be decorated with __device__ (or __global__, if you wish it to be a kernel). The nvcc compiler driver will then separate host and device code, and will use the device-compiled version of the function when it is called from (i.e. compiled in) device code, and the host version otherwise.

This construct is problematic:

deviceOp<float><<<1,1>>>(fminf);

While it may not be obvious, this is essentially all host code. Yes, it is launching a kernel (via an underlying sequence of library calls from host code), but it is technically host code. Therefore the fminf function address "captured" here will be the host version of the fminf function, even though a device version is available (via CUDA math.h, which you are not actually including).

A typical albeit clumsy approach to work around this is to "capture" the device address in device code, then pass it as a parameter to your kernel.

You can also short-circuit this process (somewhat) if you are passing function addresses that can be deduced at compile time, with a slightly different templating technique. These concepts are covered in this answer.

Here is a fully worked example of your code modified using the "capture function address in device code" method:

$ cat t1176.cu
#include <iostream>
#include <cstdio>
#include <cmath>
using namespace std; //for brevity

__device__  float g_d_a = 9, g_d_b = 5;
float g_h_a = 9, g_h_b = 5;

template<typename argT, typename funcT>
__global__
void deviceOp(funcT op){
    argT result = fminf(g_d_a, g_d_b);                  //works fine
    printf("static function result: %f\n", result);
    result = op(g_d_a,g_d_b);                           //illegal memory access
    printf("template function result: %f\n", result);
}

__device__ float (*my_fminf)(float, float) = fminf;  // "capture" device function address

template<typename argT, typename funcT>
void hostOp(funcT op){
    argT result = op(g_h_a, g_h_b);
    printf("template function result: %f\n", result);
}

int main(int argc, char* argv[]){
    hostOp<float>(min<float>);                          //works fine
    float (*h_fminf)(float, float);
    cudaMemcpyFromSymbol(&h_fminf, my_fminf, sizeof(void *));
    deviceOp<float><<<1,1>>>(h_fminf);

    cudaDeviceSynchronize();
    cout<<cudaGetErrorString(cudaGetLastError())<<endl;
}
$ nvcc -o t1176 t1176.cu
$ cuda-memcheck ./t1176
========= CUDA-MEMCHECK
template function result: 5.000000
static function result: 5.000000
template function result: 5.000000
no error
========= ERROR SUMMARY: 0 errors
$

Hm. So when I call deviceOp from the host, it is called as device code, but if I were to call it from the device, it would be device code (and would know where fminf is as it apparently does?) — Daniel B., Jun 30 '16 at 21:32
`deviceOp` is a `__global__` function. It is not actually called from the host (in the way that an ordinary host function would be called from host code) but is actually a kernel launch (which is accomplished via a sequence of library calls - to launch the function on the device.) If you did launch `deviceOp` from device code (i.e. dynamic parallelism, then that particular kernel launch from device code would be able to pick up the device fminf function address directly - and in fact the kernel launch itself would be compiled as device code. — Robert Crovella, Jun 30 '16 at 21:38
The direct `fminf` call that is explicit in `deviceOp` is in fact device code. Anything within the `__global__` or `__device__` function scope qualifers is device code, and is compiled by the device code compiler. But the **launch** of `deviceOp` from your `main` routine is actually all host code. — Robert Crovella, Jun 30 '16 at 21:40

function as templated parameter in cuda

1 Answers1

Linked