0

Let's assume that we have following codes:

while (condition) {
  ...

  for (uint32_t gap = x >> 1; gap > 0; gap >>= 1) {
    val += __shfl_down_sync(mask, val, gap);
  }

  if (warpLane == 0)
    atomicAdd(&global_memory[threadIdx.x], val);

  ...
}

In this scenario, if threads in the warp enter the while loop as the following sequence:

all 32 threads, all 32 threads, only 16 threads.

how can I get thread mask that participates in while loop statements?

Below code may cause undefined behavior according to the guide described in https://devblogs.nvidia.com/using-cuda-warp-level-primitives:

while (condition) {
  uint32_t active = __activemask();
  for (uint32_t gap = x >> 1; gap > 0; gap >>= 1) {
    val += __shfl_down_sync(active, val, gap);
  }

  if (warpLane == 0)
    atomicAdd(&global_memory[threadIdx.x], val);

  ...
}

According to the guide, __activemask() might not generate mask as I expected.

Below also causes undefined behavior according to the above guide:

while (condition) {
  uint32_t active = __activemask();
  for (uint32_t gap = x >> 1; gap > 0; gap >>= 1) {
    val += __shfl_down_sync(active, val, gap);
  }

  if (warpLane == 0)
    atomicAdd(&global_memory[threadIdx.x], val);

  ...
  __warpsync(active);
}

Then, how I can get mask correctly?

sungjun cho
  • 809
  • 7
  • 18

1 Answers1

1

You can use cooperative groups like:

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

while (condition) { 
...
auto active = cg::coalesced_threads(); // this line can be moved out of while if the condition does not cause thread divergence

 for (uint32_t gap = x >> 1; gap > 0; gap >>= 1) { 
        //val += __shfl_down_sync(mask, val, gap);
        val += active.shfl_down(val, gap);
 }
 if (warpLane == 0)
    atomicAdd(&global_memory[threadIdx.x], val); 

... 
}

If you want to generate the mask yourself and do old fashioned you can use:

uint32_t FullMask = 0xFFFFFFFF;
uint32_t mask =  __ballot_sync(FullMask, someCondition);

However if you had further branching in your code you have to always keep track of the mask before branching and use it instead of FullMask in the ballot. So the second update before branch will be:

uint32_t newMask =  __ballot_sync(mask, someNewCondition);
Oblivion
  • 7,176
  • 2
  • 14
  • 33
  • Does coalesced_threads() in while loop guarantee synchronization in collecting active? – sungjun cho Aug 29 '19 at 12:17
  • @sungjuncho it depends on the architecture. The group (active) is synchronized in all. Threads which are not in the group are not synchronized in >Volta arch – Oblivion Aug 29 '19 at 12:19
  • @sungjuncho if you want to synchronize the entire warp you have to force everyone to enter the while loop and keep the ones that are out idle – Oblivion Aug 29 '19 at 12:22
  • do u mean that all threads calling coalesced_threads() synchronize at the moment of calling this function? (I don't care other threads that do not call this function) The reason why im asking it is that I do not want behavior like activemask() that is, active threads might cause unexpected mask since they can call activemask() at different time. – sungjun cho Aug 29 '19 at 12:29