CUDA summation reduction puzzle

Question

Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]. The keyword extern is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern here? Why don't we use:

__shared__ float temp[N/2];

for instance? Or why don't we declare temp to be a global variable, e.g.

#define N 1024
__shared__ float temp[N/2];

__global__ void sum(float *sum,  float *data){ ... }

int main(){
 ...
 sum<<<M,L>>>(sum, data);
}

I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).

Note: You can find information about my devices here.

There are two different methods of allocating shared memory, one with statically allocated size, and one with dynamically allocated size. Read [here](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared). If you say `__shared__ float temp[256];` you are doing a static allocation. If you say `extern __shared__ float temp[];` you are doing a dynamic (at run-time) allocation. — Robert Crovella, Apr 07 '13 at 15:00

score 2 · Accepted Answer · edited May 23 '17 at 12:21

The answer to the first question is that CUDA supports dynamic shared memory allocation at runtime (see this SO question and the documentation for more details). The declaration of shared memory using extern denotes to the compiler that shared memory size will be determined at kernel launch, passed in bytes as an argument to the <<< >>> syntax (or equivalently via an API function), something like:

sum<<< gridsize, blocksize, sharedmem_size >>>(....);

The second question is normally to launch the number of blocks which will completely fill all the streaming multiprocessors on your GPU. Most sensibly written reduction kernels will accumulate many values per thread and then perform a shared memory reduction. The reduction requires that the number of threads per block be a power of two: That usually gives you 32, 64, 128, 256, 512 (or 1024 if you have a Fermi or Kepler GPU). It is a very finite search space, just benchmark to see what works best on your hardware. You can find a more general discussion about block and grid sizing here and here.

It would also be remiss of me not to point out that everything in the answer I posted is either in the documentation or could be found by perusing the SO CUDA frequently asked questions, or with the search engine of your choice.... — talonmies, Apr 07 '13 at 15:34
Thanks, the main problem was indeed that I had to specify size of the shared memory that should be allocated. Second, this function, as it is written, returns an array with the per-block partial sums which should be summed up in the end. I will have to modify a bit the code eventually... — Pantelis Sopasakis, Apr 08 '13 at 12:57

CUDA summation reduction puzzle

1 Answers1