Implementing Max Reduce in Cuda

Question

I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel

__global__ void max_reduce(const float* const d_array,
                     float* d_max,
                     const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;

    if (gid < elements)
        shared[tid] = d_array[gid];
    __syncthreads();

    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);
        __syncthreads();
    }

    if (gid == 0)
        *d_max = shared[tid];
}

I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.

To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.

Is there something obvious I'm missing/doing wrong?

You should perhaps intialize your shared memory to -FLOAT_MAX for max and FLOAT_MAX for min. — Pavan Yalamanchili, Jun 28 '13 at 18:43
@PavanYalamanchili The shared memory is being filled with with the global array, is there any need for putting -FLOAT_MAX in there? Additionally, the max value I'm getting back from the parallel function is less the serial max for some reason. — CurtisJC, Jun 28 '13 at 20:16
In the last block there will be few elements of shared memory that are not set (when gid >= elements). This will cause problems. — Pavan Yalamanchili, Jun 28 '13 at 23:23
@PavanYalamanchili (in reference to my deleted answer) - So I need to synchronise the device after the kernel, and then launch another kernel to combine the results? — CurtisJC, Jun 29 '13 at 09:29
You don't have to synchronize the device, the kernels are queued up in order. — Pavan Yalamanchili, Jun 29 '13 at 09:42

Robert Crovella · Accepted Answer · 2013-06-29T16:13:48.050

Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:

__device__ float atomicMaxf(float* address, float val)
{
    int *address_as_int =(int*)address;
    int old = *address_as_int, assumed;
    while (val > __int_as_float(old)) {
        assumed = old;
        old = atomicCAS(address_as_int, assumed,
                        __float_as_int(val));
        }
    return __int_as_float(old);
}


__global__ void max_reduce(const float* const d_array, float* d_max, 
                                              const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;
    shared[tid] = -FLOAT_MAX;  // 1

    if (gid < elements)
        shared[tid] = d_array[gid];
    __syncthreads();

    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);  // 2
        __syncthreads();
    }
    // what to do now?
    // option 1: save block result and launch another kernel
    if (tid == 0)        
        d_max[blockIdx.x] = shared[tid]; // 3
    // option 2: use atomics
    if (tid == 0)
      atomicMaxf(d_max, shared[0]);
}

As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.

But is running 1024 or more atomic functions (one per block) the best way? Probably not.

When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:

__device__ float atomicMaxf(float* address, float val)
{
    int *address_as_int =(int*)address;
    int old = *address_as_int, assumed;
    while (val > __int_as_float(old)) {
        assumed = old;
        old = atomicCAS(address_as_int, assumed,
                        __float_as_int(val));
        }
    return __int_as_float(old);
}


__global__ void max_reduce(const float* const d_array, float* d_max, 
                                              const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;
    shared[tid] = -FLOAT_MAX; 

    while (gid < elements) {
        shared[tid] = max(shared[tid], d_array[gid]);
        gid += gridDim.x*blockDim.x;
        }
    __syncthreads();
    gid = (blockDim.x * blockIdx.x) + tid;  // 1
    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);
        __syncthreads();
    }

    if (tid == 0)
      atomicMaxf(d_max, shared[0]);
}

With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.

Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.

Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.

For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.

Very comprehensive answer, I can see you put a lot of effort into re-writing my code! I'm starting to understand parallel processing a little better now. — CurtisJC, Jun 29 '13 at 22:58
Actually I thought you had almost everything right. Most of the code in my answer is yours. — Robert Crovella, Jun 29 '13 at 23:13
In the first snippet of code woudn't be more useful to have the checking for (gid < elements) outside the for loop? I just want to be sure I get it correctly — Roxana Istrate, Feb 09 '16 at 12:30
Perhaps you should ask a new question. With that change (and only that change) you run into the possibility of illegal use of `__syncthreads()`. — Robert Crovella, Feb 09 '16 at 15:21

Ken Seehart · Answer 2 · 2021-03-01T04:39:23.230

-1

Here's one that appears naive but isn't. This won't generalize to other functions like sum(), but it works great for min() and max().

__device__ const float float_min = -3.402e+38;

__global__ void maxKernel(float* d_data)
{ 
    // compute max over all threads, store max in d_data[0]
    int i = threadIdx.x;
    __shared__ float max_value;

    if (i == 0) max_value = float_min;
    float v = d_data[i];
    __syncthreads();

    while (max_value < v) max_value = v;

    __syncthreads();
    if (i == 0) d_data[0] = max_value;
}

Yup, that's right, only syncing once after initialization and once before writing the result. Damn the race conditions! Full speed ahead!

Before you tell me it won't work, please give it a try first. I have tested thoroughly and it works every time on a variety of arbitrary kernel sizes. It turns out that the race condition doesn't matter in this case because the while loop resolves it.

It works significantly faster than a conventional reduction. Another surprise is that the average number of passes for a kernel size of 32 is 4. Yup, that's (log(n)-1), which seems counterintuitive. It's because the race condition gives an opportunity for good luck. This bonus comes in addition to removing the overhead of the conventional reduction.

With larger n, there is no way to avoid at least one iteration per warp, but that iteration only involves one compare operation which is usually immediately false across the warp when max_value is on the high end of the distribution. You could modify it to use multiple SM's, but that would greatly increase the total workload and add a communication cost, so not likely to help.

For terseness I've omitted the size and output arguments. Size is simply the number of threads (which could be 137 or whatever you like). Output is returned in d_data[0].

I've uploaded the working file here: https://github.com/kenseehart/YAMR

edited Mar 01 '21 at 04:39

answered Feb 28 '21 at 13:43

Ken Seehart

382
3
9

what happens if max(d_data) < 0? Good point, wasn't in my use case, so overlooked. I've updated YAMR (see link) to use `const float float_min = -3.402e+38;` instead of 0.0f. – Ken Seehart Mar 01 '21 at 04:17
"what happens if you have more than 1024 input values to reduce? " I'd iterate. With the usual reduce pattern this would be suboptimal, but because max favors serialization, I think iteration is good. The idea is that if you are starting with the max of 1024 values, your second iteration would get one hit and that decreases quickly. – Ken Seehart Mar 01 '21 at 04:23
... but my use case is not IO bound since the data for which I am computing the max is computed earlier in the same kernel. If your case is IO bound and you don't have other work for other SMs, you might do some kind of grid compromise. Just keep in mind that max favors some serialization, so you want to have the compare return false as much as possible. That's why the traditional grid reduction is probably suboptimal. – Ken Seehart Mar 01 '21 at 04:35
I should mention that the race condition consideration is a bit more nuanced than I have indicated. For details, see the github link in my answer. – Ken Seehart Mar 01 '21 at 05:24
Not sure how you are getting a 404. Works for me: https://github.com/kenseehart/YAMR – Ken Seehart Mar 01 '21 at 07:39
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229334/discussion-between-ken-seehart-and-talonmies). – Ken Seehart Mar 01 '21 at 07:42
There must have been some strangeness in Github's CDN for my part of the world or something. https://pastebin.com/zY6HykJW is what a run on Google Colab looks like. I can't help but notice it prints out FAIL – talonmies Mar 01 '21 at 08:37
I think I was a bit hasty posting that code for public consumption. It works perfectly for my use case (96 values originating on the SM in three warps) and is well tested for that. I get consistent results up to 24 warps. But when I exceed 24 warps, I get a problematic race condition, so I think my reasoning doesn't hold in the general case. I've updated the code to be safer. So you are welcome to try it. Now it iterates with only one warp. In that form, it is really only beneficial in use cases where the data being maxed is produced by the kernel and is already in shared memory. – Ken Seehart Mar 02 '21 at 02:35

Implementing Max Reduce in Cuda

2 Answers2

Linked