17

I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA.

I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger (512 X 512) than a single block size.

Here is the code.

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;

    while (i < n) 
    { 
        sdata[tid] += g_idata[i] + g_idata[i+blockSize]; 
        i += gridSize; 
    }

    __syncthreads();

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

    if (tid < 32) 
    {
        if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
        if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
        if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
        if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
        if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
        if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
    }

    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

However, it seems to me the g_odata[blockIdx.x] saves the partial sums from all blocks, and, if I want to get the final result, I need to sum all the terms within the g_odata[blockIdx.x] array.

I am wondering: is there a kernel to do the whole summation? or am I misunderstanding things here? I would really appreciate if anyone can educate me with this. Thanks very much.

Vitality
  • 20,705
  • 4
  • 108
  • 146
Ono
  • 1,357
  • 3
  • 16
  • 38
  • 2
    Also please note `__shared__` data should be `volatile` in above code otherwise correct final result cannot be guaranteed. It can be seen in the link @Robert provided. – Farzad Apr 08 '14 at 14:58
  • I found this to be helpful: https://www.youtube.com/playlist?list=PLxNPSjHT5qvtYRVdNN1yDcdSl39uHV_sU – rocksNwaves Jun 21 '23 at 17:57

3 Answers3

17

Your understanding is correct. The reductions demonstrated here end up with a sequence of block-sums deposited in global memory.

To sum all of these block sums together, requires some form of global synchronization. You must wait until all the blocks are complete before adding their sums together. You have a number of options at this point, some of which are:

  1. launch a new kernel after the main kernel to sum the block-sums together
  2. add the block sums on the host
  3. use atomics to add the block sums together, at the end of the main kernel
  4. use a method like threadfence reduction to add the block sums together in the main kernel.
  5. Use CUDA cooperative groups to place a grid-wide sync in the kernel code. Sum the block sums after the grid-wide sync (perhaps in one block).

If you search around the CUDA tag you can find examples of all these, and discussions of their pros and cons. To see how the main kernel you posted is used for a complete reduction, look at the parallel reduction sample code.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
10

Robert Crovella has already answered this question, which is mainly about understanding rather than performance.

However, for all those bumping into this question, I just want to highlight that CUB makes block reduction features available. Below, I'm providing a simple worked example on how using CUB's BlockReduce.

#include <cub/cub.cuh>
#include <cuda.h>

#include "Utilities.cuh"

#include <iostream>

#define BLOCKSIZE   32

const int N = 1024;

/**************************/
/* BLOCK REDUCTION KERNEL */
/**************************/
__global__ void sum(const float * __restrict__ indata, float * __restrict__ outdata) {

    unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

    // --- Specialize BlockReduce for type float. 
    typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT; 

    // --- Allocate temporary storage in shared memory 
    __shared__ typename BlockReduceT::TempStorage temp_storage; 

    float result;
    if(tid < N) result = BlockReduceT(temp_storage).Sum(indata[tid]);

    // --- Update block reduction value
    if(threadIdx.x == 0) outdata[blockIdx.x] = result;

    return;  
}

/********/
/* MAIN */
/********/
int main() {

    // --- Allocate host side space for 
    float *h_data       = (float *)malloc(N * sizeof(float));
    float *h_result     = (float *)malloc((N / BLOCKSIZE) * sizeof(float));

    float *d_data;      gpuErrchk(cudaMalloc(&d_data, N * sizeof(float)));
    float *d_result;    gpuErrchk(cudaMalloc(&d_result, (N / BLOCKSIZE) * sizeof(float)));

    for (int i = 0; i < N; i++) h_data[i] = (float)i;

    gpuErrchk(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));

    sum<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_data, d_result);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(h_result, d_result, (N / BLOCKSIZE) * sizeof(float), cudaMemcpyDeviceToHost));

    std::cout << "output: ";
    for(int i = 0; i < (N / BLOCKSIZE); i++) std::cout << h_result[i] << " ";
    std::cout << std::endl;

    gpuErrchk(cudaFree(d_data));
    gpuErrchk(cudaFree(d_result));

    return 0;
}

In this example, an array of length N is created and the result is the sum of 32 consecutive elements. So

result[0] = data[0] + ... + data[31];
result[1] = data[32] + ... + data[63];
....
Vitality
  • 20,705
  • 4
  • 108
  • 146
  • I think iDivUp is an utility in Utilities.cuh that you do not provide in your answer. – Dimitri Lesnoff Jul 21 '22 at 14:20
  • @DimitriLesnoff Yes. Please, refer to https://github.com/vitalitylearning2021/CUDA-Utilities . – Vitality Aug 18 '22 at 06:46
  • @Vitality Thanks for your answer! quick question: what if N is not a power of 2? and, it does not fit in a block? for instance, N=1025 or even N=5000. What do you suggest? – mgNobody Sep 03 '22 at 00:54
  • @mgNobody The `if (tid < N)` takes care of it by exiting the additional threads early. – Sebastian Sep 03 '22 at 06:01
7

In order to have a better idea of this topic, you can have a look on this pdf of NVIDIA that explains, graphically, all the strategies that you have used in your code.

Leos313
  • 5,152
  • 6
  • 40
  • 69