Pre 8.x equivalent of __reduce_max_sync() in CUDA

Question

cuda-memcheck has detected a race condition in the code that does the following:

condition = /*different in each thread*/;
shared int owner[nWarps];
/* ... owner[i] is initialized to blockDim.x+1 */
if(condition) {
    owner[threadIdx.x/32] = threadIdx.x;
}

So basically this code computes the owner thread for each warp based on some condition. For some warp there could be no owner, but for some the number of owners can be more than 1, and then a race condition happens because multiple threads assign a value to the same shared memory region.

After trying the docs, I think what I need can be done with:

const uint32_t mask = __ballot_sync(0xffffffff, condition);
if(mask != 0) {
    const unsigned max_owner = __reduce_max_sync(mask, threadIdx.x);
    if(threadIdx.x == max_owner) {
        // at most 1 thread assigns here per warp
        owner[threadIdx.x/32] = max_owner;
    }
}

However, my attempt has 2 issues:

I don't really need to find the max thread - it's enough to select any 1 thread for each warp if there is a thread with condition==true
It requires CUDA compute capability 8.x, while I need to support devices of 5.2 compute capability

Could you please help me solve the above issues?

Why is letting the race condition decide, not suitable? If you declare the shared memory volatile and synchronize the warp, you can read back and know, who the owner is. — Sebastian, Oct 17 '21 at 08:42
@Sebastian , `cuda-memcheck --tool racecheck` complains about such code with level `ERROR`. — Serge Rogatch, Oct 17 '21 at 18:26

score 1 · Answer 1 · answered Oct 16 '21 at 15:01

The following function seems to solve the problem:

void SetOwnerThread(int* dest, const bool condition) {
  const uint32_t mask = __ballot_sync(0xffffffff, condition);
  if(!mask) {
    return;
  }
  const uint32_t lowest_bit = mask & -mask;
  const uint32_t my_bit = (1 << (threadIdx.x & 31));
  if(lowest_bit == my_bit) {
    dest = threadIdx.x;
  }
}

Pre 8.x equivalent of __reduce_max_sync() in CUDA

1 Answers1