I have the following "Frankenstein" sum reduction code, taken partly from the common CUDA reduction slices, partly from the CUDA samples.
__global__ void reduce6(float *g_idata, float *g_odata, unsigned int n)
{
extern __shared__ float sdata[];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
float mySum = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+MAXTREADS];
i += gridSize;
}
__syncthreads();
// do reduction in shared mem
if (tid < 256)
sdata[tid] += sdata[tid + 256];
__syncthreads();
if (tid < 128)
sdata[tid] += sdata[tid + 128];
__syncthreads();
if (tid < 64)
sdata[tid] += sdata[tid + 64];
__syncthreads();
#if (__CUDA_ARCH__ >= 300 )
if ( tid < 32 )
{
// Fetch final intermediate sum from 2nd warp
mySum = sdata[tid]+ sdata[tid + 32];
// Reduce final warp using shuffle
for (int offset = warpSize/2; offset > 0; offset /= 2)
mySum += __shfl_down(mySum, offset);
}
sdata[0]=mySum;
#else
// fully unroll reduction within a single warp
if (tid < 32) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
#endif
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
I will be using this to reduce an unrolled array of big size (e.g. 512^3 = 134217728 = n
) on a Tesla k40 GPU.
I have some questions regarding the blockSize
variable, and its value.
From here on, I will try to explain my understanding (either right or wrong) on how it works:
The bigger I choose blockSize
, the faster this code will execute, as it will spend less time in the whole loop, but it will not finish reducing the whole array, but it will return a smaller array of size dimBlock.x
, right? If I use blockSize=1
this code would return in 1 call the reduction value, but it will be really slow because its not exploiting the power of CUDA almost anything. Therefore I need to call the reduction kernel several times, each of the time with a smaller blokSize
, and reducing the result of the previous call to reduce, until I get to the smallest point.
something like (pesudocode)
blocks=number; //where do we start? why?
while(not the min){
dim3 dimBlock( blocks );
dim3 dimGrid(n/dimBlock.x);
int smemSize = dimBlock.x * sizeof(float);
reduce6<<<dimGrid, dimBlock, smemSize>>>(in, out, n);
in=out;
n=dimGrid.x;
dimGrid.x=n/dimBlock.x; // is this right? Should I also change dimBlock?
}
In which value should I start? I guess this is GPU dependent. Which values shoudl it be for a Tesla k40 (just for me to understand how this values are chosen)?
Is my logic somehow flawed? how?