CUDA: How to return a device lambda from a host function

Question

I have a virtual function which returns a different lambda depending on the derived class:

class Base
{
public:
    virtual std::function<float()> foo(void) = 0;
};

class Derived : public Base
{
public:
    std::function<float()> foo(void) {
        return [] __device__ (void) {
            return 1.0f;
        };
    }
};

Then I want to pass this lambda to a CUDA kernel and call it from the device. In other words, I want to do this:

template<typename Func>
__global__ void kernel(Func f) {
    f();
}

int main(int argc, char** argv)
{
    Base* obj = new Derived;
    kernel<<<1, 1>>>(obj->foo());
    cudaDeviceSynchronize();
    return 0;
}

Tha above give an error like this: calling a __host__ function("std::function<float ()> ::operator ()") from a __global__ function("kernel< ::std::function<float ()> > ") is not allowed

As you can see, I declare my lambda as __device__, but the foo() method stores it in a std::function in order to return it. As a result, what is passed to the kernel() is a host address and of course it does not work. I guess that is my problem, right? So my questions are:

Is it somehow possible to create a __device__ std::function and return that from the foo() method?
If this is not possible, is there any other way to dynamically select a lambda and pass it to the CUDA kernel? Hard-coding multiple calls to kernel() with all the possible lambdas is not an option.

So far, from the quick research I did, CUDA does not have/support the necessary syntax required to make a function return a device lambda. I just hope I am wrong. :) Any ideas?

Thanks in advance

I don't think this is a question of syntax. `std::function` isn't supported on the device, which is the source of the compile error, from what I can see — talonmies, May 28 '17 at 19:28
You may want to read [this](https://stackoverflow.com/questions/41381254/cuda-c11-array-of-lambdas-function-by-index-not-working/41399827#41399827). I think it is certainly possible to use a `__device__` lambda as a kernel parameter/argument, but you are probably using `std::function` because you want to "genericize" it - you don't like the fact that [every lambda has a unique type](https://stackoverflow.com/questions/7477310/why-cant-i-create-a-vector-of-lambda-in-c11). You won't be able to use `std::function` to work around that, I don't think. It might be easier to use functors for this. — Robert Crovella, May 28 '17 at 20:46

einpoklum · Accepted Answer · 2017-05-29T15:07:53.993

Before actually answering, I have to wonder whether your question isn't an XY problem. That is, I am by default skeptical that people have a good excuse for executing code through lambdas/function pointers on the device.

But I won't evade your question like that...

Is it somehow possible to create a __device__ std::function and return that from the foo() method?

Short answer: No, try something else.

Longer answer: If you want to implement a large chunk of the standard library on the device side, then maybe you could have a device-side std::function-like class. But I'm not sure that's even possible (quite possibly not), and anyway - it's beyond the capabilities of everyone except very seasoned library developers. So, do something else.

If this is not possible, is there any other way to dynamically select a lambda and pass it to the CUDA kernel? Hard-coding multiple calls to kernel() with all the possible lambdas is not an option.

First, remember that lambdas are essentially anonymous classes - and thus, if they don't capture anything, they're reducible to function pointers since the anonymous classes have no data, just an operator().

So if the lambdas have the same signature and no capture, you can cast them into a (non-member-)function pointer and pass those to the function; and this definitely works, see this simple example on nVIDIA's forums.

Another possibility is using run-time mapping from type id's or other such keys into instances of these types, or rather, to constructors. That is, using a factory. But I don't want to get into the details of that to not make this answer longer than it already is; and it's probably not a good idea.

score 1 · Answer 2 · answered Feb 12 '18 at 20:13

While I don't think you can achieve what you want using virtual functions that return device lambdas, you can achieve something similar by passing a static device member function as the template parameter to your kernel. An example is provided below. Note that the classes in this example could also be structs if you prefer.

#include <iostream>

// Operation: Element-wise logarithm
class OpLog {
    public:
    __device__ static void foo(int tid, float * x) {
        x[tid] = logf(x[tid]);
    };
};

// Operation: Element-wise exponential
class OpExp {
    public:
    __device__ static void foo(int tid, float * x) {
        x[tid] = expf(x[tid]);
    }
};

// Generic kernel
template < class Op >
__global__ void my_kernel(float * x) {
    int tid = threadIdx.x;
    Op::foo(tid,x);
}

// Driver
int main() {

    using namespace std;

    // length of vector
    int len = 10;

    // generate data
    float * h_x = new float[len];
    for(int i = 0; i < len; i++) {
        h_x[i] = rand()/float(RAND_MAX);
    }

    // inspect data
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;

    // copy onto GPU
    float * d_x;
    cudaMalloc(&d_x, len*sizeof(float));
    cudaMemcpy(d_x, h_x, len*sizeof(float), cudaMemcpyHostToDevice);

    // Take the element-wise logarithm
    my_kernel<OpLog><<<1,len>>>(d_x);

    // get result
    cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;

    // Take the element-wise exponential
    my_kernel<OpExp><<<1,len>>>(d_x);

    // get result
    cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;


}

CUDA: How to return a device lambda from a host function

2 Answers2