Thrust transform throws error: "bulk_kernel_by_value: an illegal memory access was encountered"

Question

I'm rather new to CUDA/Thrust and have a problem with a code snippet. To make it easier I have trimmed it down to the bare minimum. The code is the following:

struct functor{
functor(float (*g)(const float&)) : _g{g} {}

__host__ __device__ float operator()(const float& x) const { 
        return _g(x);
    }
private:
    float (*_g)(const float&);
};

__host__ __device__ float g(const float& x){return 3*x;}

int main(void){
thrust::device_vector<float> X(4,1);
thrust::transform(X.begin(), X.end(), X.begin(), functor(&g));
}

The idea is that I can pass any function to the functor, so I can apply that function to every element in a Vector. Unfortunately I'm uncertain to why I get the described error. I Compile with -w -O3 -shared -arch=sm_20 -std=c++11 -DTHRUST_DEBUG

I'm thankful for any help you all can give me :)

The short reason is the `&g` isn't a pointer to a device function and it isn't valid to pass it to a closure which will run on the GPU — talonmies, Jan 19 '16 at 15:15

score 6 · Accepted Answer · edited May 23 '17 at 10:28

The address of a __device__ function, (or __host__ __device__) cannot be taken in host code, for use on the device:

thrust::transform(X.begin(), X.end(), X.begin(), functor(&g));
                                                         ^
                                                     You will not get the 
                                                     __device__ function
                                                     address here

There are many questions on stackoverflow which discuss usage of CUDA device function addresses passed via kernel calls. This answer links to several which may be of interest.

One possible approach to fix this would be to acquire the device function address in device code, and pass it to the host, for usage like you are describing:

$ cat t1057.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <iostream>
struct functor{
functor(float (*g)(const float&)) : _g{g} {}

__host__ __device__ float operator()(const float& x) const {
        return _g(x);
    }
private:
    float (*_g)(const float&);
};

__host__ __device__ float g(const float& x){return 3*x;}

__device__ float (*d_g)(const float&) = g;

int main(void){
float (*h_g)(const float&) = NULL;
cudaMemcpyFromSymbol(&h_g, d_g, sizeof(void *));
thrust::device_vector<float> X(4,1);
thrust::transform(X.begin(), X.end(), X.begin(), functor(h_g));
thrust::copy_n(X.begin(), X.size(), std::ostream_iterator<float>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t1057 t1057.cu -std=c++11
$ ./t1057
3,3,3,3,
$

Another possible approach, leveraging the always-clever work by @m.s. here uses templating:

$ cat t1057.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <iostream>

typedef float(*fptr_t)(const float&);

template <fptr_t F>
struct functor{

  __host__ __device__ float operator()(const float& x) const {
        return F(x);
    }
};

__host__ __device__ float g(const float& x){return 3*x;}


int main(void){
thrust::device_vector<float> X(4,1);
thrust::transform(X.begin(), X.end(), X.begin(), functor<g>());
thrust::copy_n(X.begin(), X.size(), std::ostream_iterator<float>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t1057 t1057.cu -std=c++11
$ ./t1057
3,3,3,3,
$

I think that is the only way to do this (although I'm a bit surprised that static initialisation works like that). Once upon a time it would have been necessary to run a small "setter" kernel to get the address of the device function and set it to the global memory symbol. — talonmies, Jan 19 '16 at 15:29
@m.s. gave a clever approach using templating and c++11 [here](http://stackoverflow.com/questions/31694730/cuda-copy-dynamically-created-array-of-function-pointers-on-the-cpu-to-gpu-memo/31699687#31699687) In effect, the function addresses are still being "captured" in device code. — Robert Crovella, Jan 19 '16 at 15:38

score 0 · Answer 2 · answered Jan 19 '23 at 00:34

It's also helpful to check How does CUDA's cudaMemcpyFromSymbol work?.

cudafe (frontend) creates a normal global variable as in C and also a CUDA-specific PTX variable. The global C variable is used so that the host program can refer to the variable by its address, and the PTX variable is used for the actual storage of the variable.

The presence of the host variable also allows the host compiler to successfully parse the program. When the device program executes, it operates on the PTX variable when it manipulates the variable by name.

Basically, host and device has different address space. You can't mix both. That is to say, you can only use function pointer from device space on device, instead of from host

Thrust transform throws error: "bulk_kernel_by_value: an illegal memory access was encountered"

2 Answers2

Linked