CUDA: How to add a device function from host to an array declared in Device?

Question

I have several device functions defined before the main() and I would like to fill an array in global memory with these functions so that any kernel could access them. I know I could directly declare a statical device array[] and fill it with the various functions but this is not what I want because it would mean it is hardcoded and I want to make it as simple as possible for the user. So I prefer to avoid hardcoding and I would like to leave the user the possibility to add the functions he/she needs from the host. Therefore, I have been playing with function pointers in device and host, and this is where I got stuck. Here is an example of what I would like to do but that does not work:

typedef void (*ptrfunction)(int &, int &);



// Example of possible device functions:

__device__ void add(int &type1, int &type2){

// Do something
}


__device__ void multiply(int &type1, int &type2){

// Do something
}


__device__ void divide(int &type1, int &type2){

// Do something
}


// Array to be filled with these functions
__device__ ptrfunction listfunctions[10];



// Function pointers in the Device

// IMPORTANT: The following function pointer is not pointing to anything at this moment
__device__ ptrfunction devfunc;

// Function pointer that do point to a function:
__device__ ptrfunction devptr = multiply;



int main(){

ptrfunction hostptr;


// What I would like: A function that allows the user to add the different device functions he wants here.
// Something like:

AddFunction(add);
AddFunction(multiply);
....

AddFunction should somehow fill the array listfunctions declared in the device with add, multiply, etc... The question here is how to access these device functions from the host and pass them to the Device ? The first thing I tried was to directly copy add or multiply device functions to the function pointer devfunc using cudaMemcpyFromSymbol as follows, thinking that then I could directly use devfunc from any kernel:

cudaMemcpyToSymbol(devfunc, multiply, sizeof(pf_interacts), 0, cudaMemcpyHostToDevice);

It does not compile: error: no instance of overloaded function "cudaMemcpyToSymbol" matches the argument list

I tried cudaMemcpyToSymbol with int declared in the device, and it works fine. But when it comes to function pointers, there is something that does not work.

An example that works but that I want to avoid is:

cudaMemcpyFromSymbol(&hostptr, devptr, sizeof(ptrfunction));
kernel<<<...,...>>>(hostptr);
cudaDeviceSynchronize();

This example works well. The function pointer devptr declared in the device, and that points to the function multiply, is copied to a function pointer in the host. Then, it can be passed as argument to a kernel. The drawback of this, as I explained before, is that I must hardcode before the main the function at which the pointer is pointing to, which I want to avoid. Using the previous example that works I tried to directly copy a device function to the function pointer in the host, instead of copying the device function pointer. Here I try to copy the device function multiply to the host function pointer:

cudaMemcpyFromSymbol(&hostptr, multiply, sizeof(ptrfunction));
kernel<<<...,...>>>(hostptr);
cudaDeviceSynchronize();

It compiles without complaining. But when I launch it with cuda-mem that is what I get:

Program hit cudaErrorInvalidSymbol (error 13) due to "invalid device symbol" on CUDA API call to cudaMemcpyFromSymbol. ... Hardware Stack Overflow ========= at 0x00022ca8 Device Frame:kernel......

Copying a device function pointer to a host function pointer works, but when I try to copy directly a function, it does not work. Pitty.

The question thus is, how to access, from the host, to the address of a device function (add, multiply, ...) to pass it either as argument of a kernel or using an API to copy it to a device function pointer ?

Curious that two very similar but unusual questions are being asked, like this one [here](http://stackoverflow.com/questions/31694730/cuda-copy-dynamically-created-array-of-function-pointers-on-the-cpu-to-gpu-memo). Is this a homework assignment for some course? Anyway, the address of a device function cannot be directly taken in host code. It will require some kind of device code activity to access the address of a device function. — Robert Crovella, Jul 29 '15 at 15:03
@RobertCrovella Good observation ! The person who posted the other question is my PhD student. Hehe. I wanted to formulate the question another way with my own findings. What I do not understand is that it is possible to copy with cudaMemcpyFromSymbol() a device function pointer to a host function pointer, but not directly the function device itself. — Christophe J. Ortiz, Jul 29 '15 at 15:11
Correct. The "device function pointer" (i.e. `devptr`) is really no different than any other `__device__` variable, and `cudaMemcpyFromSymbol` will work just fine with that. But `multiply` is different than an ordinary `__device__` variable. It is the address of a `__device__` function, and that address cannot be *directly* accessed in host code. It requires some sort of device interaction, either via a setup kernel as @m.s. indicated in the other question I linked to, or using some method to put the device function address into a "device function pointer" variable. — Robert Crovella, Jul 29 '15 at 15:16
This [question/answer](http://stackoverflow.com/questions/31057870/passing-host-function-as-a-function-pointer-in-global-or-device-function) may also be of interest (perhaps another one of your students). You may also want to investigate using the driver API and possibly the [nvrtc](http://docs.nvidia.com/cuda/nvrtc/index.html#axzz3hEPmGF2a) CUDA runtime compilation mechanism. There is also this [ugly thing](http://stackoverflow.com/questions/22824897/online-compilation-of-single-cuda-function/24498416#24498416) I did in the past. Some of these links may give you ideas. — Robert Crovella, Jul 29 '15 at 15:20
Function objects (functors) (for example as used by [thrust](http://stackoverflow.com/questions/31684346/thrust-not-calling-device-function)) and [CUDA 7.5 experimental device lambdas](http://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/) are other ideas that may be interesting in this case. — Robert Crovella, Jul 29 '15 at 19:13
@RobertCrovella Dear Robert, thank you for your replies. After few days working on that problem, my PhD student finally found a nice and short workaround and was able to pass a device function to the GPU from the Host. He is writing the answer in the other post. — Christophe J. Ortiz, Jul 30 '15 at 13:52
Great. The method described there is just using a setup kernel, which method is included in one of the links I sent you. That is to say, the need for device code interaction cannot be avoided. And in fact, that setup kernel is not actually passing a device pointer from the host to the device. Anyway, can we mark this question as a duplicate of that one? — Robert Crovella, Jul 30 '15 at 15:35
Yes, I think the question can be marked as duplicate of the other one. As you can see in the other post, the solution my student found is relatively simple. It uses a template, which passes directly the device function as parameter to a templated kernel. I must confess it surprises me that it works and without any cuda API. — Christophe J. Ortiz, Jul 30 '15 at 16:12

CUDA: How to add a device function from host to an array declared in Device?

0 Answers0