9

First of all, let me state that I am fully aware that my question has been already asked: Block reduction in CUDA However, as I hope to make clear, my question is a follow-up to that and I have particular needs that make the solution found by that OP to be unsuitable.

So, let me explain. In my current code, I run a Cuda kernel at every iteration of a while-loop to do some computations over the values of an array. As an example, think of it like the following:

int max_iterations = 1000;
int iteration = 0;
while(iteration < max_iterations)
{
    __global__ void calcKernel(int* idata, int* odata)
    {
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n)
        {
            odata[i] = (idata[i] + 2) * 5;
        }
    }

    iteration++;
}

However, next I have to accomplish seemingly hard task for the GPU. At each iteration of the while-loop that calls the kernel, I have to sum all values generated within odata and save the result in an intarray called result, at a position within such array that corresponds to the current iteration. It has to be accomplished inside the kernel or at least still in the GPU because due to performance constrains, I can only retrieve the result array in the very end after all iterations are completed.

A wrong naïve attempt woud look something like the following:

int max_iterations = 1000;
int iteration = 0;
while(iteration < max_iterations)
{
    __global__ void calcKernel(int* idata, int* odata, int* result)
    {
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n)
        {
            odata[i] = (idata[i] + 2) * 5;
        }
    }

    result[iteration] = 0;
    for(int j=0; j < max_iterations; j++)
    {
        result[iteration] += odata[j];            
    }

    iteration++;
}

Of course, the code above does not work due to the GPU distributing the code across threads. In order to lear how to properly do that, I have been reading other questions here in the site about array reduction using CUDA. In particular, I found a mention to a very good NVIDIA's pdf about such subject, which is also discussed in the former SO question I mentioned at the beginning: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

However, while I fully understand the steps of the code described in such slides, as well as the general optimizations, I don't get how that approach can sum-reduce an array to one number if the code actually ouptus a full array (and one of unclear dimensions). Could someone please shed some light about it and show me an example on how that would work (i.e. how to get the one-number out of the output array)?

Now, going back to that question I mentioned at the beginning (Block reduction in CUDA). Note that its accepted answer merely suggests one to read the pdf I linked above - which does not talk about what to do with the output array generated by the code. In the comments, the OP there mentions that he/she was able to finishi the job by summing the output array at the CPU - which is something I cannot do, since that would mean downloading the output array every iteration of my while-loop. Lastly, the third answer in that link suggests the use of a library to accomplish this - but I am interested in learning the native way of doing so.

Alternatively, I would be also very interested in any other propositions about how to implement what I am described above.

Community
  • 1
  • 1
user123443563
  • 171
  • 1
  • 2
  • 8
  • Would you consider using several kernels, instead of a single one? – Ricardo Ortega Magaña Mar 01 '17 at 18:51
  • @RicardoOrtegaMagaña Sure, provided that I don't transfer from/to GPU memory, calling more than one kernel won't be a problem per se. – user123443563 Mar 01 '17 at 19:59
  • 1
    I would recommend then, use 2 kernels, 1 for the first calculations, and another kernel, just for the adding, and putting the loop, in the Host program. I need more information of how your program works, but with the code you show us, that might be an easy way to solve your problem. – Ricardo Ortega Magaña Mar 01 '17 at 20:58
  • @RicardoOrtegaMagaña I'm now trying just that, but I'm struggling with the last step. Exactly the same thing that was originally asked in the question I linked to (but which there was solved at the CPU side). It is, I'm using a kernel now to do the reduction, but I end up with an array, not a single number, and I want to further reduce this final array to one single number within the GPU. – user123443563 Mar 01 '17 at 21:07
  • In the example i gave you, the result its on the first element of the array, if you want it in a single variable (memory location) send a pointer in the kernel, and assign that value to that location, where you want that value, that way, when you want to read it, from the Host, or the Device, you can just give that pointer. – Ricardo Ortega Magaña Mar 01 '17 at 21:10
  • Would you consider ‘Thrust’ the API to do the sum for you? It decouples the functionality and easy at your use. An API call on GPU wouldn’t be that time consuming from my own experience. Tens of microseconds should do the job. – WDC Apr 25 '18 at 00:45

2 Answers2

8

You have already found the canonical information regarding block parallel reductions, so I will not repeat that. If you don't want to write a lot of new code yourself to do this, I would suggest looking at the CUB library block_reduce implementation, which provides an optimal block wise reduction operation with the addition of about 4 lines of code to your existing kernel.

On the real question here, you can do what you want if you do something like this:

__global__ void kernel(....., int* iter_result, int iter_num) {

    // Your calculations first so that each thread holds its result

    // Block wise reduction so that one thread in each block holds sum of thread results

    // The one thread holding the adds the block result to the global iteration result
    if (threadIdx.x == 0)
        atomicAdd(iter_result + iter_num, block_ressult);
}

The key here is that an atomic function is used to safely update the kernel run result with the results from a given block without a memory race. You absolutely must initialise iter_result before running the kernel, otherwise the code won't work, but that is the basic kernel design pattern.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Very insightful. I was just able to properly implement that and it works. The only downside in my case is that I'm working with chars and atomicAdd does not work with chars, so I had to convert the relevant parts, wasting a bit of memory and a bit of performance. Still, in a pure comparison "atomicAdd" versus "just saving block results in the output array", I'm surprised by how little performance was lost by using atomicAdd. – user123443563 Mar 01 '17 at 23:54
  • @talonmies For very large arrays, won't recursive or iterative use of the reduction kernel be faster than the atomic add after just 1 pass/iteration of the kernel? Suppose I have an array of 262144 elements and my threads per block are 256 and blocks per thread are 1024. So we will have first to collect 1024 values at index 0 from each of the block's output and then do something to these 1024 elements. If an atomic operation is used, won't it always be slower rather than reducing these 1024 elements again? Perhaps user123443563 can guide as well. (I am new to CUDA, apologize if basic q). – Mashhood Ahmad Apr 01 '22 at 21:00
5

If you add 2 contiguous numbers, and save the result, in any of the slots where you save those numbers, you will only have to run, multiple times the same kernel, to keep reducing in 2 power the array's sums, like in this example:

Array to sum values:

[·1,·2,·3,·4,·5,·6,·7,·8,·9,·10]

First run n/2 threads, sum contiguous array elements, and store it on the "left" of each, the array will now look like:

[·3,2,·7,4,·11,6,·15,8,·19,10]

Run the same kernel, run n/4 threads, now add each 2 elements, and store it on the left most element, array now will look like:

[·10,2,7,4,·26,6,15,8,·19,10]

Run the same kernel, run n/8 threads, now add each 4 elements, and store in the left most element in the array, to obtain:

[·36,2,7,4,26,6,15,8,·19,10]

Run one last time, a single thread to add each 8 elements, and store in the left most element in the array, to obtain:

[55,2,7,4,26,6,15,8,19,10]

This way, you only have to run your kernel with some threads as parameters, to obtain the redux at the end, in the first element (55) look at the "dots" (·) to see which elements in the array are "active" to sum them, each run.

  • Thanks for your answer. Although I understand the logic, I don't see how to implement that. Could you give me a minor example? It looks like you are talking about exactly the same thing that was proposed in the `pdf`I've linked in the question. – user123443563 Mar 01 '17 at 07:59