Let's assume that we have following codes:
while (condition) {
...
for (uint32_t gap = x >> 1; gap > 0; gap >>= 1) {
val += __shfl_down_sync(mask, val, gap);
}
if (warpLane == 0)
atomicAdd(&global_memory[threadIdx.x], val);
...
}
In this scenario, if threads in the warp enter the while loop as the following sequence:
all 32 threads, all 32 threads, only 16 threads.
how can I get thread mask that participates in while loop statements?
Below code may cause undefined behavior according to the guide described in https://devblogs.nvidia.com/using-cuda-warp-level-primitives:
while (condition) {
uint32_t active = __activemask();
for (uint32_t gap = x >> 1; gap > 0; gap >>= 1) {
val += __shfl_down_sync(active, val, gap);
}
if (warpLane == 0)
atomicAdd(&global_memory[threadIdx.x], val);
...
}
According to the guide, __activemask() might not generate mask as I expected.
Below also causes undefined behavior according to the above guide:
while (condition) {
uint32_t active = __activemask();
for (uint32_t gap = x >> 1; gap > 0; gap >>= 1) {
val += __shfl_down_sync(active, val, gap);
}
if (warpLane == 0)
atomicAdd(&global_memory[threadIdx.x], val);
...
__warpsync(active);
}
Then, how I can get mask correctly?