It's easiest to explain via cub::LaneId()
or a function like the following:
inline __device__ unsigned get_lane_id() {
unsigned ret;
asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
return ret;
}
Versus computing the lane ID as threadIdx.x & 31
.
Do these 2 approaches produce the same value in a 1D grid?
__ballot_sync()
documentation speaks of lane IDs in its mask
parameter, and as I understand it returns the bits set per lane ID. So would the following asserts never fail?
int nWarps = /*...*/;
bool condition = /*...*/;
if(threadIdx.x < nWarps) {
assert(__activemask() == ((1u<<nWarps)-1));
uint32_t res = __ballot_sync(__activemask(), condition);
assert(bool(res & (1<<threadIdx.x)) == condition);
}