Finding max value in CUDA

Question

I am trying to write a code in CUDA for finding the max value for the given set of numbers.

Assume you have 20 numbers, and the kernel is running on 2 blocks of 5 threads. Now assume the 10 threads compare the first 10 values at the same time, and thread 2 finds a max value, so thread 2 is updating the max value variable in global memory. While thread 2 is updating, what will happen to the remaining threads (1,3-10) that will be comparing using the old value?

If I lock the global variable using atomicCAS(), will the threads (1,3-10) compare using the old max value? How can I overcome this problem?

jwdmsd · Answer 1 · 2014-11-13T08:08:13.550

11

This is a purely a reduction problem. Here's a good presentation by NVIDIA for optimizing reduction on GPUs. You can use the same technique to either find the minimum, maximum or sum of all elements.

edited Nov 13 '14 at 08:08

answered Mar 10 '11 at 06:12

jwdmsd

2,107
2
16
30

1

Related question to the presentation: [Sum reduction with CUDA: What is N?](https://stackoverflow.com/q/8176488/673852) The accepted answer there suggests there's a typo in the final kernel code. – Ruslan Jun 06 '19 at 13:56

score 1 · Answer 2 · answered Aug 11 '17 at 08:30

1

The link for Thrust library is broken.
If anyone finds it useful to use it in this case, you can find the documentation here:
Thrust, extrema reductions

answered Aug 11 '17 at 08:30

Panos Kalatzantonakis

12,525
8
64
85

score -1 · Answer 3 · answered Mar 10 '11 at 07:42

-1

Unless you're trying to write a reduction kernel, the simplest way is simply to use the CUBLAS.

answered Mar 10 '11 at 07:42

Edric

23,676
2
38
40

score -1 · Answer 4 · answered Oct 17 '19 at 15:18

-1

I looked for the same answer but found most are too formidable to a newbie like me. Here is my example code to find the max. Please let me know if this is used properly.

__global__
void find_max(int max_x, int max_y, float *tot, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    int j = blockIdx.y*blockDim.y + threadIdx.y;
    if(i < max_x && j<max_y) {
        if(*tot < x[i])
            atomicExch(tot, x[i]);
    }
}

answered Oct 17 '19 at 15:18

William

75
7

1

There's probably a race condition there: compare succeeds, another thread stores a new max, then this thread overwrites that with its max (which might be less than the one another thread stored). This is why the question mentions CAS, not Exch. (Also, you modify `x[i]` for no reason, and your function doesn't even use the `y[]` arg.) It also looks horribly inefficient to have everything atomically exchange into one shared counter instead of having a local max over many points and then combine results from multiple threads at the end. (I don't know CUDA so IDK how to do that.) – Peter Cordes Dec 03 '19 at 10:36

Finding max value in CUDA

4 Answers4

Linked