Function call from cuda kernel not working correctly

Question

I have the following cuda Kernel :

template <class T,typename Func>
__global__
void for_each_kernel (T* d_v,int N,Func f)
{
    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int num_threads = gridDim.x * blockDim.x;

    __shared__ T s_x[1024];

    for(int i = idx; i < N; i += num_threads)
    {

        s_x[threadIdx.x] = d_v[i];
        f(&s_x[threadIdx.x]); // This Does Not Work Correctly
        //s_x[threadIdx.x] *=  10; // This Works Correctly
        d_v[i] = s_x[threadIdx.x];
    }
}

When I call a function f it print the wrong value. But when I do the multiplication directly, it works.

Both portions are highlighted in the comments.

Here is the function which is passed :

__device__
void func(int *x) 
{
    *x = (*x) * 10;

}

Here is how the kernel call is happening :

template <typename Func>
void for_each (int start,int end,Func f)
{   
    for_each_kernel<<<26,1024>>> (d_v,end-start+1,f);   
}

Each of a[i] is initialized to i.

The value that I am printing is a[10], which should print 100. But it is printing 32767 which is 2^15 - 1.

The complete example code which exhibits the problem is as follows:

#include <cstdio>
using namespace std;


__device__
void func(int *x)
{
    *x = (*x) * 10;

}

template <class T,typename Func>
__global__
void for_each_kernel (T* d_v,int N,Func f)
{
    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int num_threads = gridDim.x * blockDim.x;

    __shared__ T s_x[1024];

    for(int i = idx; i < N; i += num_threads)
    {

        s_x[threadIdx.x] = d_v[i];
        f(&s_x[threadIdx.x]);
        //s_x[threadIdx.x] *=  10;
        d_v[i] = s_x[threadIdx.x];
    }
}

template <class T>
class device_vector
{
    T *d_v;
    int numEle;

    public :

    device_vector (T *h_v,int N)
    {
        cudaMalloc  ((T**)&d_v,N*sizeof(T));
        cudaMemcpy(d_v, h_v, N * sizeof(T), cudaMemcpyHostToDevice);
        numEle = N;


    }

    void set (T data,int index)
    {
        cudaMemcpy (&d_v[index],&data,sizeof(T),cudaMemcpyHostToDevice);
    }


    T get (int index)
    {
        T temp;
        cudaMemcpy (&temp,&d_v[index],sizeof(T),cudaMemcpyDeviceToHost);
        return temp;
    }

    void getRange (T *dest,T *start,int N)
    {
        cudaMemcpy (dest,start,sizeof(T)*N,cudaMemcpyDeviceToHost);
    }


    // Only Provide Start And End Vertices Fot Which you Want To Do Some Operation
    template <typename Func>
        void for_each (int start,int end,Func f)
        {  
            for_each_kernel<<<26,1024>>> (d_v,end-start+1,f);  
        }  

};


int a[1<<28];
int main ()
{
    int N = 1<<28;

    for (int i=0;i<N;i++)
        a[i] = i;

    device_vector<int> d (a,N);

    d.for_each (0,N-1,func);

    printf ("Getting Element At Index %d : %d \n",100,d.get(10));

    return 0;
}

You would need to supply a complete repro case before someone could tell you what was wrong. The likelihood is that you are not passing a valid device function to the kernel, but nowhere to you show that, so it is impossible to say for sure. — talonmies, Feb 18 '16 at 19:17
@talonmies i have given my __device__ function in my question. I have also given you the line where kernel is launched. What else should i give...? — Rohith R, Feb 18 '16 at 19:20
Nowhere do you show where the function value is obtained which you pass into `for_each`. I'm 99.9% sure that is where the error will be, but I can say that is the problem without an [MCVE](http://stackoverflow.com/help/mcve) — talonmies, Feb 18 '16 at 19:23
@talonmies http://pastebin.com/sbwfmMKY this might help. This is the complete code. — Rohith R, Feb 18 '16 at 19:29
Yes. The problem lies on code you have chosen not to include in your question. — talonmies, Feb 18 '16 at 19:43
@talonmies Pls tell me where the problem is. That would be helpful. — Rohith R, Feb 18 '16 at 19:46
@PRP - that was a gentle hint that all relevant code must be in this question. The value of the question is for others to know what the problem was, not just for you to have your problem solved and move on. — void_ptr, Feb 18 '16 at 19:52
@PRP: Code at an external link isn't very useful on [SO]. Please edit your question to contain the *minimum, complete* code which someone else could compile and run which reproduces the problem. — talonmies, Feb 19 '16 at 05:33
Related [question 1](http://stackoverflow.com/questions/15644261/cuda-function-pointers) , [question 2](http://stackoverflow.com/questions/9000388/device-function-pointers). — sgarizvi, Feb 19 '16 at 10:23

sgarizvi · Accepted Answer · 2016-02-19T10:32:58.673

The problem is that you are passing a __device__ function pointer from host to the kernel. Taking the address of a device symbol directly from the host is illegal. You have to use cudaMemcpyFromSymbol to get the address of a device symbol on host side. In the current code, you have to create a host side function pointer to the __device__ function and pass it to the kernel. It can be done as follows:

//Declare function pointer
typedef void(*pFunc)(int*);

//Function pointer on device
__device__ pFunc f_device = func;

int main()
{
    int N = 1 << 28;

    for (int i = 0; i<N; i++)
        a[i] = i;

    device_vector<int> d(a, N);

    //Function pointer on host
    pFunc f_host;

    //Copy address of device function to host
    cudaMemcpyFromSymbol(&f_host, f_device, sizeof(pFunc));

    d.for_each(0, N - 1, f_host);

    printf("Getting Element At Index %d : %d \n", 100, d.get(100));

    return 0;
}

Function call from cuda kernel not working correctly

1 Answers1