1

I'm new to cuda and currently doing parallel reduction using cuda. I've done quite some research and know that threads in a warp should be in synchronization without calling "__syncthreads()". However, when I test the kernel with the following(launched only 1 block with 32 threads):

__global__ void TestKernel()
{
    int tid = threadIdx.x;
    __shared__ float temp[32];
    temp[threadIdx.x] = 1;

    printf(" temp[%d] = %f\n", threadIdx.x, temp[threadIdx.x]);

    int thread = tid % 32;

    if (thread < 16){
        temp[thread] += temp[thread + 16];
        //__syncthreads();
        temp[thread] += temp[thread + 8];
        //__syncthreads();
        temp[thread] += temp[thread + 4];
        //__syncthreads();
        temp[thread] += temp[thread + 2];
        //__syncthreads();
        temp[thread] += temp[thread + 1];
    }
    printf("  temp[%d] = %f\n", 0 , temp[0]);
}

And I launched the kernel by:

dim3 Blockdim(32);
TestKernel << <1, Blockdim >> >();

What I'm doing is I assign value 1 to an array of size of 32, and add all of them together using parallel reduction, and store the final result to the first location of the array. This does not give me right output. It outputs temp[0]=6 instead of 32. However, If I uncomment the "__syncthreads()" each step, it will produce right answer of 32.

So this essentially shows me that the threads in a warp is not in-sync as they say. Can anybody explain what's going on here? There are a few things I can't make sure in the program: 1, Am I using only one warp and all 32 threads in this warp in this kernel call? 2, if I am using only one warp, and all threads are within this warp, why are they appearing not in sync and only in sync when I call "__synthreads()"?

Thank you in advance for your helps !

  • 2
    no use of `volatile`? That's a programming error. Study the [CUDA reduction tutorial](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf). – Robert Crovella Jul 21 '17 at 02:34
  • Oh THANKS a lot! That's the keyword I have been missing. I actually followed the exactly same tutorial in your link but I missed to add `volatile` in declaring the variables where the tutorial actually highlighted that adding `volatile` is a must if we want to save the call of `__syncthreads()`. Once I add volatile in front, the code works perfectly. – Leonard Hung Jul 21 '17 at 02:56
  • Better Q&A would probably be: https://stackoverflow.com/questions/19439552/when-to-use-volatile-with-register-local-variables and https://stackoverflow.com/questions/15331009/when-to-use-volatile-with-shared-cuda-memory. Now, note that in modern C++, you don't use `volatile`, but rather more expressive types that are designed to provide exactly the semantics that you need. I'm familiar with Cuda programming, so I'm not sure if they still use volatile, or you'd be better off following modern C++ practices with something like `std::atomic` or `std::memory_order`. – Cody Gray - on strike Jul 21 '17 at 05:39
  • 2
    @CodyGray volatile is used in CUDA to enforce a specific compiler behaviour and there is no other way to do it. Also I chose the duplicate I did specifically because the code is the exact same reduction example add this question – talonmies Jul 21 '17 at 09:11
  • 1
    While CUDA [claims compliance](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-cplusplus-language-support) to a particular C++ standard (with restrictions), it also [specifically restricts use of the standard library to host code](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#standard-library). There are a few exceptions to this, and some innocuous usages of `std::` functionality happen to work in device code, but in general, suggestions to use `std::` functionality in device code should not be made unless carefully vetted/tested. – Robert Crovella Jul 21 '17 at 19:02

0 Answers0