Say we have n threads writing to two variables max and index.
we read from W and p, where:
W is an array with n entries
p is a value that will be replaced by max later
each individual thread i computes some function f(W[i],p)
I would like max to store the max of all f(W[i],p) and index to hold the thread id i that equals argmax_i {f(W[i],p)}.
clearly I need Atomicmax and AtomicExch to be done together atomically.
the following pseudocode won't work since argmax (AtomicExch) may update incorrect (we can't control when the lock is grabbed on the variable index)
for each thread id i
oldvalue= AtomicMax (max,f(p,W[i]))
if oldvalue < f(p,W[i]) then
AtomicExch(index,i);
Is there a work around that minimizes global memory writes? Clearly I could pair the AtomicMax and AtomicExch together into one critical section. However, locks are questionable in CUDA. Furthermore, I could write the (index, f(p,W[i])) pairs to a queue then find the max later, however this involves many global memory writes...