I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl
).
Basically the relevant part of the kernel might be something like this:
int f = d[threadIdx.x];
int warpLeader = <something in [0,32)>;
// Point being, some threads in the warp get removed by i < stop
for(int i = k; i < stop; i+=skip)
{
// Point being, potentially more threads don't see the shuffle below.
if(mem[threadIdx.x + i/2] == foo)
{
// Pre CUDA 9.0.
f = __shfl(f, warpLeader);
}
}
Maybe that's not the best example (real code too complex to post), but the two things accomplished easily with the old intrinsics were:
- Shuffle/broadcast to whatever threads happen to be here at this time.
- Still get to use the warp-relative thread index.
I'm not sure how to do the above post CUDA 9.0.
This question is almost/partially answered here: How can I synchronize threads within warp in conditional while statement in CUDA?, but I think that post has a few unresolved questions.
I don't believe __shfl_sync(__activemask(), ...)
will work. This was noted in the linked question and many other places online.
The linked question says to use coalesced_group
, but my understanding is that this type of cooperative_group
re-ranks the threads, so if you had a warpLeader
(on [0, 32)) in mind as above, I'm not sure there's a way to "figure out" its new rank in the coalesced_group
.
(Also, based on the truncated comment conversation in the linked question, it seems unclear if coalesced_group
is just a nice wrapper for __activemask()
or not anyway ...)
It is possible to iteratively build up a mask using __ballot_sync
as described in the linked question, but for code similar to the above, that can become pretty tedious. Is this our only way forward for CUDA > 9.0?