CUDA C++11, array of lambdas, function by index, not working

Question

I am having trouble trying to make a CUDA program manage an array of lambdas by their index. An example code that reproduces the problem

 #include <cuda.h>
 #include <vector>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <sys/time.h>
 #include <cassert>

 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
 inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
     if (code != cudaSuccess) {
         fprintf(stderr,"GPUassert: %s %s %d\n",
         cudaGetErrorString(code), file, line);
         if (abort) exit(code);
     }   
 }

 template<typename Lambda>
 __global__ void kernel(Lambda f){ 
     int t = blockIdx.x * blockDim.x + threadIdx.x;
     printf("device: thread %i: ", t); 
     printf("f() = %i\n", f() );
 }

 int main(int argc, char **argv){
     // arguments
     if(argc != 2){ 
         fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
         exit(EXIT_FAILURE);
     }   
     int i = atoi(argv[1]);


     // lambdas
     auto lam0 = [] __host__ __device__ (){ return 333; };
     auto lam1 = [] __host__ __device__ (){ return 777; };


     // make vector of functions
     std::vector<int(*)()> v;
     v.push_back(lam0);
     v.push_back(lam1);


     // host: calling a function by index
     printf("host: f() = %i\n", (*v[i])() );


     // device: calling a function by index
     kernel<<< 1, 1 >>>( v[i] ); // does not work
     //kernel<<< 1, 1 >>>( lam0 ); // does work
     gpuErrchk( cudaPeekAtLastError() );
     gpuErrchk( cudaDeviceSynchronize() );
     return EXIT_SUCCESS;
 }

Compiling with

nvcc -arch sm_60 -std=c++11 --expt-extended-lambda main.cu -o prog

The error I get when running is

➜  cuda-lambda ./prog 0
host: f() = 333
device: GPUassert: invalid program counter main.cu 53

It seems that CUDA cannot manage the int(*)() function pointer form (while host c++ does work properly). On the other hand, each lambda is managed as a different data type, no matter if they are identical in code and have the same contract. Then, how can we achieve function by index in CUDA?

Your code, if working, would cause alternative paths (couldn't be inlined) which isn't desired behaviour when dealing with GPU. Maybe instead you could create an array of kernels/kernel calls with values of lambda set at compile-time? — W.F., Dec 29 '16 at 14:20
Will take a look at that, assuming is possible to create an array of __ global __ lambdas. — Cristobal Navarro, Dec 29 '16 at 14:31
I'm pretty sure your are relying on some static compiler analysis magic within the CUDA parser which breaks when the lambda is put in container. — talonmies, Dec 29 '16 at 14:41
M. Harris has wrote some answers in Nvidia blog indicating that CUDA does not have all the capabilities a host c++ program would have regarding lambdas. In that case this issue could be one of the designs not suited for GPU computing, for now?. — Cristobal Navarro, Dec 29 '16 at 17:11
You can use function pointers in CUDA kernels, but not in the way you're trying to do here. The crux is that you can't get a pointer to a device function directly from host code, which is necessary in order to make your scheme work. You need to jump through some hoops to make it work, which are [detailed in the answer to this question](https://stackoverflow.com/questions/15644261/cuda-function-pointers). Note that the example there doesn't use lambdas, but the same principles apply. — Jason R, Dec 30 '16 at 04:29
@JasonR: You are correct regarding the function pointer observation, but I don't think the standard solution you link to is applicable in this case. What is happening is that the lambda expression, which is notionally a *type*, is being degraded to a function pointer, which is legal in C++11. But I don't believe the CUDA toolkit generates a traditional API accessible symbol for a lambda defined like this, so there is no way to get the device pointer which would make this work. — talonmies, Dec 30 '16 at 09:28

Robert Crovella · Accepted Answer · 2017-05-28T22:45:29.210

There are a few considerations here.

Although you suggest wanting to "manage an array of lambdas", you are actually relying on the graceful conversion of a lambda to a function pointer (possible when the lambda does not capture).

When you mark something as __host__ __device__, you are declaring to the compiler that two copies of said item need to be compiled (with two obviously different entry points): one for the CPU, and one for the GPU.

When we take a __host__ __device__ lambda and ask it to degrade to a function pointer, we are then left with the question "which function pointer (entry point) to choose?" The compiler no longer has the option to carry about the experimental lambda object anymore, and so it must choose one or the other (host or device, CPU or GPU) for your vector. Whichever one it chooses, the vector could (will) break if used in the wrong environment.

One takeaway from this is that your two test cases are not the same. In one case (broken) you are passing a function pointer to the kernel (so the kernel is templated to accept a function pointer argument) and in the other case (working) you are passing a lambda to the kernel (so the kernel is templated to accept a lambda argument).

The problem here, in my view, is not simply arising out of use of a container, but arising out of the type of container you are using. I can demonstrate this in a simple way (see below) by converting your vector to a vector of actual lambda type. In that case, we can make the code "work" (sort of), but since every lambda has a unique type, this is an uninteresting demonstration. We can create a multi-element vector, but the only element we can store in it is one of your two lambdas (not both at the same time).

If we use a container that can handle dissimilar types (e.g. std::tuple), perhaps we can make some progress here, but I know of no direct method to index through the elements of such a container. Even if we could, the template kernel accepting lambda as argument/template type would have to be instantiated for each lambda.

In my view, function pointers avoid this particular type "messiness".

Therefore, as an answer to this question:

Then, how can we achieve function by index in CUDA?

I would suggest for the time being that function by index in host code be separated (e.g. two separate containers) from function by index in device code, and for function by index in device code, you use any of the techniques (which don't use or depend on lambdas) covered in other questions, such as this one.

Here is a worked example (I think) demonstrating the note above, that we can create a vector of lambda "type", and use the resultant element(s) from that vector as lambdas in both host and device code:

$ cat t64.cu
 #include <cuda.h>
 #include <vector>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <sys/time.h>
 #include <cassert>

 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
 inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
     if (code != cudaSuccess) {
         fprintf(stderr,"GPUassert: %s %s %d\n",
         cudaGetErrorString(code), file, line);
         if (abort) exit(code);
     }
 }


 template<typename Lambda>
 __global__ void kernel(Lambda f){
     int t = blockIdx.x * blockDim.x + threadIdx.x;
     printf("device: thread %i: ", t);
     printf("f() = %i\n", f() );
 }

 template <typename T>
 std::vector<T> fill(T L0, T L1){
   std::vector<T> v;
   v.push_back(L0);
   v.push_back(L1);
   return v;
}

 int main(int argc, char **argv){
     // arguments
     if(argc != 2){
         fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
         exit(EXIT_FAILURE);
     }
     int i = atoi(argv[1]);


     // lambdas
     auto lam0 = [] __host__ __device__ (){ return 333; };
     auto lam1 = [] __host__ __device__ (){ return 777; };

     auto v = fill(lam0, lam0);

     // make vector of functions
 //    std::vector< int(*)()> v;
 //    v.push_back(lam0);
 //    v.push_back(lam1);


     // host: calling a function by index
     // host: calling a function by index
     printf("host: f() = %i\n", (*v[i])() );


     // device: calling a function by index
     kernel<<< 1, 1 >>>( v[i] ); // does not work
     //kernel<<< 1, 1 >>>( lam0 ); // does work
     gpuErrchk( cudaPeekAtLastError() );
     gpuErrchk( cudaDeviceSynchronize() );
     return EXIT_SUCCESS;
 }

$ nvcc -arch sm_61 -std=c++11 --expt-extended-lambda t64.cu -o t64
$ cuda-memcheck ./t64 0
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$ cuda-memcheck ./t64 1
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$

As mentioned above already, this code is not a sensible code. It is advanced to prove a particular point.

Many thanks. The other option I was trying was to just use __ device __ defined lambdas, but the compiler fails to put the lambda on the int(*)() type vector. I will take your advice then as it can still satisfy the design I was planning. — Cristobal Navarro, Dec 30 '16 at 18:41
This version works because the lambdas are not degraded to function pointers within the vector. Nicely done. — talonmies, Dec 31 '16 at 10:30
indeed, but the approach can only handle copies of the same lambda to mantain the unique type. — Cristobal Navarro, Jan 03 '17 at 18:45

CUDA C++11, array of lambdas, function by index, not working

1 Answers1

Linked