I am trying to pass object to kernel. This object has basically two variables, one acts as the input and the other as the output of the kernel. But when I launch kernel the output variable does not change. But when I add another variable to kernel and assign the output value to this variable as well, it suddenly works for both of them.
I've read in another thread (While loop fails in CUDA kernel) that the compiler can evaluate kernel as empty for optimizing purposes if it doesn't produce any output.
So it is possible that this input/output object that I'm passing as the only kernel argument isn't somehow recognized by the compiler as an output? And if that's true. Is there an elegant way (I would like to avoid adding another kernel argument) such as compiling option that can prevent this?
This is the class for this object.
class Replica
{
public :
signed char gA[1024];
int MA;
__device__ __host__ Replica(){
}
};
And this is the kernel that is basically a sum reduction.
__global__ void sumKerA(Replica* Rd)
{
int t = threadIdx.x;
int b = blockIdx.x;
__shared__ signed short gAs[1024];
gAs[t] = Rd[b].gA[t];
for (unsigned int stride = 1024 >> 1; stride > 0; stride >>= 1){
__syncthreads();
if (t < stride){
gAs[t] += gAs[t + stride];
}
}
__syncthreads();
if (t == 0){
Rd[b].MA = gAs[0];
}
}
And finally my host code.
int main ()
{
// replicas - array of objects
Replica R[128];
for (int i = 0; i < 128; ++i){
for (int j = 0; j < 1024; ++j){
R[i].gA[j] = 2*(rand() % 2) - 1;
}
R[i].MA = 0;
}
Replica* Rd;
cudaSetDevice(0);
cudaMalloc((void **)&Rd,128*sizeof(Replica));
cudaMemcpy(Rd,R,128*sizeof(Replica),cudaMemcpyHostToDevice);
dim3 DimBlock(1024,1,1);
dim3 DimGridA(128,1,1);
sumKerA <<< DimBlock, DimGridA >>> (Rd);
cudaThreadSynchronize();
cudaMemcpy(&R,Rd,128*sizeof(Replica),cudaMemcpyDeviceToHost);
// cudaMemcpy(&M,Md,128*sizeof(int),cudaMemcpyDeviceToHost);
for (int i = 0; i < 128; ++i){
cout << R[i].MA << " ";
}
cudaFree(Rd);
return 0;
}