cuda-memcheck
has detected a race condition in the code that does the following:
condition = /*different in each thread*/;
shared int owner[nWarps];
/* ... owner[i] is initialized to blockDim.x+1 */
if(condition) {
owner[threadIdx.x/32] = threadIdx.x;
}
So basically this code computes the owner thread for each warp based on some condition. For some warp there could be no owner, but for some the number of owners can be more than 1, and then a race condition happens because multiple threads assign a value to the same shared memory region.
After trying the docs, I think what I need can be done with:
const uint32_t mask = __ballot_sync(0xffffffff, condition);
if(mask != 0) {
const unsigned max_owner = __reduce_max_sync(mask, threadIdx.x);
if(threadIdx.x == max_owner) {
// at most 1 thread assigns here per warp
owner[threadIdx.x/32] = max_owner;
}
}
However, my attempt has 2 issues:
- I don't really need to find the max thread - it's enough to select any 1 thread for each warp if there is a thread with
condition==true
- It requires CUDA compute capability 8.x, while I need to support devices of 5.2 compute capability
Could you please help me solve the above issues?