Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]
. The keyword extern
is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern
here? Why don't we use:
__shared__ float temp[N/2];
for instance? Or why don't we declare temp
to be a global variable, e.g.
#define N 1024
__shared__ float temp[N/2];
__global__ void sum(float *sum, float *data){ ... }
int main(){
...
sum<<<M,L>>>(sum, data);
}
I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).
Note: You can find information about my devices here.