Why does passing a function to a kernel cause data to become immutable?

Question

I've reduced my project down to the just the relevant code. The part of this that is really bothering me is that this does not produce any errors. Anyways, I have a struct GpuData

struct GpuData { float x, y, z; };

My goal is to launch a kernel against this struct that takes a function and will apply the function to the struct. So lets look at an example kernel:

__global__ void StructFunctor(GpuData* in_dat, nvstd::function<float(void)> func) {
    in_dat->x = func();
    in_dat->y += T{1};
};

In this case, the kernel is reduced to something very simple. It will set the x value to the result of the function. It will then add 1 to the y value.

So lets try it. A complete source file (cuda_demo.cu):

#include <iostream>
#include <nvfunctional>

struct GpuData { float x, y, z; };

__global__ void StructFunctor(GpuData* in_dat, nvstd::function<float(void)> func) {
    in_dat->x = func();
    in_dat->y += float{1};
};

int main(int argc, char** argv) {
    GpuData c_dat {2, 3, 5};
    std::cout << "Input x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;

    GpuData* g_dat;
    cudaMalloc(&g_dat, sizeof(GpuData));
    cudaMemcpy(g_dat, &c_dat, sizeof(GpuData), cudaMemcpyHostToDevice);

    StructFunctor<<<1, 1>>>(g_dat, []()->float{return 1.0f;});

    cudaMemcpy(&c_dat, g_dat, sizeof(GpuData), cudaMemcpyDeviceToHost);
    std::cout << "Output x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;
    return 0;
}

Well if we are actually going to try it, we will need the Cmake files. I've tacked those on at the end.

On my machine it compiles and runs without errors. Here is my output:

./CudaDemo
Input x: 2 y: 3 z: 5
Output x: 2 y: 3 z: 5

They variable has not been modified at all! But if I go back and comment out in_dat-> = func(); then I get this output:

./CudaDemo
Input x: 2 y: 3 z: 5
Output x: 2 y: 4 z: 5

Now the y value has been modified! Thats a good start, but why is it that when I try and use the function the gpu memory becomes immutable? I presume this is some sort of error, but it compiles and runs without warnings or errors.

Now as promised, the cmake files to run this.

cmake_minimum_required(VERSION 3.8)
project(Temp LANGUAGES CXX CUDA)
set(CMAKE_CUDA_STANDARD 14)
add_executable(CudaDemo cuda_demo.cu)
set_property(TARGET CudaDemo PROPERTY CUDA_SEPARABLE_COMPILATION ON)

" it compiles and runs without warnings or errors." Since you're not doing any proper CUDA error checking, you wouldn't know about any CUDA errors. If you run your code with `cuda-memcheck` you will see that it is throwing errors. If you turn on the memory checker functionality in VS and run it there, you may witness the same. — Robert Crovella, Aug 21 '18 at 17:06
Your lambda definition is in host code. Therefore it is a host-code lambda, and it is not callable on the device. You'll need to learn how to use a device code lambda. — Robert Crovella, Aug 21 '18 at 17:10
@RobertCrovella How would I create a device code lambda and pass it through the __global__ method? Also, I'm running on linux command line so I dont have access to the visual studio debugging help. — esdanol, Aug 21 '18 at 17:11
There are probably many ways. However when I google "CUDA device lambda", the first hit I get is [this](https://devblogs.nvidia.com/new-compiler-features-cuda-8/). It may be worth a read. I think you'll find questions about device lambda usage here on the `cuda` tag as well, with a bit of searching. — Robert Crovella, Aug 21 '18 at 17:13
all of my comments except the VS memory checker one apply equally to linux. Run your code with `cuda-memcheck` on linux. If you're not sure how to use `cuda-memcheck`, try google. — Robert Crovella, Aug 21 '18 at 17:15

Robert Crovella · Accepted Answer · 2018-08-21T17:58:50.353

The problem is that your code is creating a lambda in host code (so it is compiled for whatever host processor you specify) and then you are attempting to use that compiled lambda in device code. This won't work. If you run your code with cuda-memcheck it indicates an error which may take one of several forms, I see a message of "Invalid PC", which means that your program attempted to execute an instruction from an invalid location:

$ cuda-memcheck ./t277
========= CUDA-MEMCHECK
Input x: 2 y: 3 z: 5
========= Invalid PC
=========     at 0x00000048 in void StructFunctor<float>(GpuData<float>*, nvstd::function<float () (void)>)
=========     by thread (0,0,0) in block (0,0,0)
=========     Device Frame:void StructFunctor<float>(GpuData<float>*, nvstd::function<float () (void)>) (void StructFunctor<float>(GpuData<float>*, nvstd::function<float () (void)>) : 0x40)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x2486ed]
=========     Host Frame:./t277 [0x190b2]
=========     Host Frame:./t277 [0x192a7]

In CUDA, if you want to use a lambda in device code, you must decorate it properly, just like any other code you intend to execute on the device. An initial introduction of this concept was made here, although you can find many other examples.

There are probably many ways to fix the code, depending on your final intent, but an approach that hews closely to the aforementioned introduction/link might look like this:

$ cat t277.cu
#include <iostream>
template <typename T>
struct GpuData {
    T x;
    T y;
    T z;
};

template <typename T, typename F>
__global__ void StructFunctor(GpuData<T>* in_dat, F f) {
    in_dat->x = f();
    in_dat->y += T{1};
};

int main(int argc, char** argv) {
    GpuData<float> c_dat {2, 3, 5};
    std::cout << "Input x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;

    GpuData<float>* g_dat;
    cudaMalloc(&g_dat, sizeof(GpuData<float>));
    cudaMemcpy(g_dat, &c_dat, sizeof(GpuData<float>), cudaMemcpyHostToDevice);
    StructFunctor<float><<<1, 1>>>(g_dat, [] __host__ __device__ ()->float{return 1.0f;});

    cudaMemcpy(&c_dat, g_dat, sizeof(GpuData<float>), cudaMemcpyDeviceToHost);
    std::cout << "Output x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;
    return 0;
}
$ nvcc -std=c++11 t277.cu -o t277 --expt-extended-lambda
$ cuda-memcheck ./t277
========= CUDA-MEMCHECK
Input x: 2 y: 3 z: 5
Output x: 1 y: 4 z: 5
========= ERROR SUMMARY: 0 errors
$

(the __host__ decorator that I added to the lambda is not necessary in this particular case, but the __device__ decorator is.)

Note that I'm working off the original code you posted, not the modified version edited into your question by @einpoklum

Before asking others for help, if you are having trouble with a CUDA code, I usually recommend that you be sure to do proper CUDA error checking and run your code with cuda-memcheck. Even if you don't understand the output, it will be useful for those trying to help you.

Thanks for pointing out cuda-memcheck. I'm finally able to see the error. I've made the modifications you've recommended and it runs. This leaves me with the question of why did that work. I understand why the host device decoration is needed, but if you keep the nvstd::function as the type of the function passed, there is no compilation error but an illegal memory access is encountered. According to the nvidia documentation you posted in a comment earlier, the polymorphic function wrapper should work for extended lambdas. — esdanol, Aug 21 '18 at 18:46
The `nvstd::function` support can indeed be used, but it is not a replacement for proper decoration. Stated another way, the usage of `nvstd::function` does not automatically create host and device versions. Stated another way, you did not define/create an extended lambda definition (you created an ordinary lambda), and the `nvstd::function` usage will not modify your definition for you. — Robert Crovella, Aug 21 '18 at 18:49
`nvstd::function` provides a container for callable entities. But an object of that type will still only be usable **either** in host code or device code **but not both**. Note the disclaimer [here](https://devblogs.nvidia.com/new-compiler-features-cuda-8/) "One caveat: you still cannot pass nvstd::function objects initialized in host code to device code (and vice versa)." If you create an object of type `nvstd::function` in host code, using ordinary methods, it will capture a host-callable entity (not device-). This is a fairly involved topic related to CUDA device function pointer usage. — Robert Crovella, Aug 21 '18 at 18:56
I think you've missed my question there. I am keeping the entire file the same except for the declaration of StructFunctor which I want to constrain the type of F to be a `function`. When I use that constraint, is the device decoration from the initialization of the lambda in the main function lost? Unless this is what you are referring to in your second comment. If the lambda is decorated device, is it still considered as being created in host code? — esdanol, Aug 21 '18 at 19:40
And a further question: How would the kernel be called if I have a different host device function that is not a lambda? — esdanol, Aug 21 '18 at 21:33
[This answer](https://stackoverflow.com/questions/31057870/passing-host-function-as-a-function-pointer-in-global-or-device-function/31058123#31058123) links to a variety of resources on device function pointer usage. — Robert Crovella, Aug 21 '18 at 22:00

einpoklum · Answer 2 · 2018-08-21T18:54:48.173

Indeed, as @RobertCrovella notes, the address of (host-only) lambda is not a valid device-side address, and so the constructed nvstd::function is not invocable). When you try and invoke it in the kernel, you get an error. Here is your code (well, my edit of your code), converted to use proper error checking:

#include <iostream>
#include <nvfunctional>
#include <cuda/api_wrappers.h>

struct GpuData { float x, y, z; };

__global__ void StructFunctor(GpuData* in_dat, nvstd::function<float(void)> func) {
    in_dat->x = func();
    in_dat->y += float{1};
};

int main(int argc, char** argv) {
    using std::cout; using std::endl;
    GpuData c_dat {2, 3, 5};
    cout << "Input x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << endl;

    auto current_device = cuda::device::current::get();
    auto g_dat = cuda::memory::device::make_unique<GpuData>(current_device);
    cuda::memory::copy(g_dat.get(), &c_dat, sizeof(GpuData));
    device.launch(StructFunctor, cuda::make_launch_config(1, 1), 
        g_dat.get(), []()->float { return 1.0f; });
    cuda::outstanding_error::ensure_none(); // This is where we'll notice the error
    cuda::memory::copy(&c_dat, g_dat.get(), sizeof(GpuData));
    cout << "Output x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;
}

When you run this, you get:

Input x: 2 y: 3 z: 5
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  Synchronously copying data: an illegal memory access was encountered
Aborted

And the fix would be:

#include <iostream>
#include <cuda/api_wrappers.h>

struct GpuData { float x, y, z; };

template <typename F>
__global__ void StructFunctor(GpuData* in_dat, F func) {
    in_dat->x = func();
    in_dat->y += float{1};
};

int main(int argc, char** argv) {
    using std::cout; using std::endl;
    GpuData c_dat {2, 3, 5};
    cout << "Input x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << endl;

    auto device = cuda::device::current::get();
    auto g_dat = cuda::memory::device::make_unique<GpuData>(device);
    cuda::memory::copy(g_dat.get(), &c_dat, sizeof(GpuData));
    auto return_one = [] __device__ ()->float { return 1.0f; };
    device.launch(StructFunctor<decltype(return_one)>, cuda::make_launch_config(1, 1), g_dat.get(), return_one);
    cuda::outstanding_error::ensure_none();
    cuda::memory::copy(&c_dat, g_dat.get(), sizeof(GpuData));
    cout << "Output x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << endl;
}

To use the CUDA API Wrappers, add this to your CMakeLists.txt

ExternalProject_Add(cuda-api-wrappers_project 
    PREFIX CMakeFiles/cuda-api-wrappers_project 
    TMP_DIR CMakeFiles/cuda-api-wrappers_project/tmp 
    STAMP_DIR CMakeFiles/cuda-api-wrappers_project/stamp 
    GIT_REPOSITORY git@github.com:eyalroz/cuda-api-wrappers.git
    GIT_TAG 7e48712af95939361bf04e4f4718688795a319f9  
    UPDATE_COMMAND "" 
    SOURCE_DIR "${CMAKE_SOURCE_DIR}/cuda-api-wrappers"
    BUILD_IN_SOURCE 1 
    INSTALL_COMMAND ""
  )

According to my compiler cuda/api_wrappers.h does not exist. Is this a standard cuda library? — esdanol, Aug 21 '18 at 18:49
No, though it probably should be... You can get from [here](https://github.com/eyalroz/cuda-api-wrappers/). Fair disclosure: I'm the author :-) — einpoklum, Aug 21 '18 at 18:52

Why does passing a function to a kernel cause data to become immutable?

2 Answers2