3

After reading the question and its answer from the following
LINK

I still have a question remaining in my mind. From my background in C/C++; I understand that using volatile has it's demerits. And also it is pointed in the answers that in case of CUDA, the optimizations can replace shared array with registers to keep data if volatile keyword is not used.

I want to know what would be the performance issues that can be encountered when calculating (sum) reduction. e.g.

__device__ void sum(volatile int *s_data, int tid)
{
    if (tid < 16)
    {
        s_data[tid] += s_data[tid + 16];
        s_data[tid] += s_data[tid +  8];
        s_data[tid] += s_data[tid +  4];
        s_data[tid] += s_data[tid +  2];
        s_data[tid] += s_data[tid +  1];
    }
}

I am using in warp reduction. Since all the threads with in warp are in sync, therefore I believe there is no need to use syncthreads() construct.

I want to know will removing the keyword volatile mess up my sum (due to cuda optimizations)? Can I use reduction such as this without volatile keyword.

Since I use this reduction function multiple time, will volatile keyword cause any performance degradation?

Community
  • 1
  • 1
Psypher
  • 396
  • 1
  • 3
  • 13

1 Answers1

8

Removing the volatile keyword from that code could break that code on Fermi and Kepler GPUS. Those GPUs lack instructions to directly operate on shared memory. Instead, the compiler must emit a load/store pair to and from register.

What the volatile keyword does in this context is make the compiler honour that load-operate-store cycle and not perform an optimisation that would keep the value of s_data[tid] in register. To keep the sum accumulating in register would break the implicit memory syncronisation required to make that warp level shared memory summation work correctly.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Can you provide me with some code where I can see this breakdown. I am using Fermi Arch based GPU. I executed this reduction without `volatile` as someone suggested and the resulting sum was still correct. With code I could see how the load and would be honored as you mentioned. – Psypher Jan 18 '14 at 14:47
  • There is a complete discussion of this in the "Fermi compatibility guide for CUDA applications" pdf. Note that I said *could* break. Whether it will depends on compiler, optimisation settings and register pressure in the kernel. Just because your code works with the compiler you are using doesn't mean it will work with every version of the CUDA toolkit in the same way, or that it will be safe with every kernel you might write. – talonmies Jan 18 '14 at 15:06
  • So therefore it is a better practice to use `volatile` to make sure that summation always works as it is expected to be. Thanks that the exact answer I was looking for. – Psypher Jan 18 '14 at 15:08