I'm new to cuda and currently doing parallel reduction using cuda. I've done quite some research and know that threads in a warp should be in synchronization without calling "__syncthreads()". However, when I test the kernel with the following(launched only 1 block with 32 threads):
__global__ void TestKernel()
{
int tid = threadIdx.x;
__shared__ float temp[32];
temp[threadIdx.x] = 1;
printf(" temp[%d] = %f\n", threadIdx.x, temp[threadIdx.x]);
int thread = tid % 32;
if (thread < 16){
temp[thread] += temp[thread + 16];
//__syncthreads();
temp[thread] += temp[thread + 8];
//__syncthreads();
temp[thread] += temp[thread + 4];
//__syncthreads();
temp[thread] += temp[thread + 2];
//__syncthreads();
temp[thread] += temp[thread + 1];
}
printf(" temp[%d] = %f\n", 0 , temp[0]);
}
And I launched the kernel by:
dim3 Blockdim(32);
TestKernel << <1, Blockdim >> >();
What I'm doing is I assign value 1 to an array of size of 32, and add all of them together using parallel reduction, and store the final result to the first location of the array. This does not give me right output. It outputs temp[0]=6 instead of 32. However, If I uncomment the "__syncthreads()" each step, it will produce right answer of 32.
So this essentially shows me that the threads in a warp is not in-sync as they say. Can anybody explain what's going on here? There are a few things I can't make sure in the program: 1, Am I using only one warp and all 32 threads in this warp in this kernel call? 2, if I am using only one warp, and all threads are within this warp, why are they appearing not in sync and only in sync when I call "__synthreads()"?
Thank you in advance for your helps !