computing the index of the max of a computed function of array elements: avoiding global memory writes

Question

Say we have n threads writing to two variables max and index.

we read from W and p, where:

W is an array with n entries

p is a value that will be replaced by max later

each individual thread i computes some function f(W[i],p)

I would like max to store the max of all f(W[i],p) and index to hold the thread id i that equals argmax_i {f(W[i],p)}.

clearly I need Atomicmax and AtomicExch to be done together atomically.

the following pseudocode won't work since argmax (AtomicExch) may update incorrect (we can't control when the lock is grabbed on the variable index)

for each thread id i
    oldvalue= AtomicMax (max,f(p,W[i])) 
    if oldvalue < f(p,W[i]) then
        AtomicExch(index,i);

Is there a work around that minimizes global memory writes? Clearly I could pair the AtomicMax and AtomicExch together into one critical section. However, locks are questionable in CUDA. Furthermore, I could write the (index, f(p,W[i])) pairs to a queue then find the max later, however this involves many global memory writes...

If both the max and index can be represented as 32-bit quantities, then you could use a [custom 64-bit atomic](https://stackoverflow.com/questions/17411493/how-can-i-implement-a-custom-atomic-function-involving-several-variables/17414007#17414007). Alternatively, max + index is something that can be accomplished with a classical parallel reduction. The answer in the linked duplicate question discusses both. — Robert Crovella, Oct 16 '18 at 02:59
@RobertCrovella In your custom 64-bit atomic, can you explain why you initialize to floats[2] and ints[2]. Why not just floats[1] and ints[1]? — user352102, Oct 16 '18 at 17:42
btw, which way is faster: the 64-bit atomic or parallel reduction? — user352102, Oct 16 '18 at 17:57
`floats[1]` would only provide one 32-bit quantity. I don't understand the question. We need a union that occupies 64-bits. I'm not sure what is faster, but if every thread were doing this, I would expect the parallel reduction method to be faster. — Robert Crovella, Oct 16 '18 at 19:47
@RobertCrovella why then access ints[1] as lowest index why not ints[0]? — user352102, Oct 17 '18 at 14:52

computing the index of the max of a computed function of array elements: avoiding global memory writes

0 Answers0