0

I am trying to pass object to kernel. This object has basically two variables, one acts as the input and the other as the output of the kernel. But when I launch kernel the output variable does not change. But when I add another variable to kernel and assign the output value to this variable as well, it suddenly works for both of them.

I've read in another thread (While loop fails in CUDA kernel) that the compiler can evaluate kernel as empty for optimizing purposes if it doesn't produce any output.

So it is possible that this input/output object that I'm passing as the only kernel argument isn't somehow recognized by the compiler as an output? And if that's true. Is there an elegant way (I would like to avoid adding another kernel argument) such as compiling option that can prevent this?

This is the class for this object.

class Replica
{
    public :
        signed char gA[1024];
        int MA;
    __device__ __host__ Replica(){
    }   
};

And this is the kernel that is basically a sum reduction.

__global__ void sumKerA(Replica* Rd) 
{
    int t = threadIdx.x;
    int b = blockIdx.x;

    __shared__ signed short gAs[1024];
    gAs[t] = Rd[b].gA[t];

    for (unsigned int stride = 1024 >> 1; stride > 0; stride >>= 1){
        __syncthreads();
        if (t < stride){
            gAs[t] += gAs[t + stride];
        }
    }
    __syncthreads();

    if (t == 0){
        Rd[b].MA = gAs[0];
    }
}

And finally my host code.

int main ()
{
    // replicas - array of objects
    Replica R[128];
    for (int i = 0; i < 128; ++i){
        for (int j = 0; j < 1024; ++j){
            R[i].gA[j] = 2*(rand() % 2) - 1;
        }
        R[i].MA = 0;
    }

    Replica* Rd;

    cudaSetDevice(0);

    cudaMalloc((void **)&Rd,128*sizeof(Replica));
    cudaMemcpy(Rd,R,128*sizeof(Replica),cudaMemcpyHostToDevice);

    dim3 DimBlock(1024,1,1);
    dim3 DimGridA(128,1,1);

    sumKerA <<< DimBlock, DimGridA >>> (Rd);
    cudaThreadSynchronize();

    cudaMemcpy(&R,Rd,128*sizeof(Replica),cudaMemcpyDeviceToHost);
    // cudaMemcpy(&M,Md,128*sizeof(int),cudaMemcpyDeviceToHost);
    for (int i = 0; i < 128; ++i){
        cout << R[i].MA << " ";
    }

    cudaFree(Rd);

    return 0;
}
tomix86
  • 1,336
  • 2
  • 18
  • 29
L'ahim
  • 43
  • 4

1 Answers1

0

Based on your reduction code, it appears that you intend to launch 1024 threads per block.

In that case, this is incorrect:

dim3 DimBlock(1024,1,1);
dim3 DimGridA(128,1,1);

sumKerA <<< DimBlock, DimGridA >>> (Rd);

The first kernel configuration parameter is the dimensions of the grid. The second parameter is the dimension of the threadblock. If you want 1024 threads per block, while launching 128 blocks, your kernel launch should look like this:

sumKerA <<< DimGridA, DimBlock >>> (Rd);

If you add proper cuda error checking to your code, I expect you would see a kernel launch failure, because using the block variable (blockIdx.x) to index into the Rd array of 128 elements would index beyond the end of the array, in your original case.

If you modify the Replica objects pointed to by Rd in your kernel, that is externally visible state, so any code that modifies those objects cannot be "optimized away" by the compiler.

Also note that cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize() (they have the same behavior.)

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Yes, you are right, Robert. I'm creating code for a complex simulation and I made this code to try a new programming approach that I can use in the simulation. Unfortunately, I made such a trivial mistake while my mind was focused somewhere else. I feel ashamed because it hasn't happened to me before. Maybe, that's a punishment for me to be such a lazy programmer. I should really start to use error checking. Thanks a lot for all your help. – L'ahim Nov 11 '14 at 14:11