Cuda _sync functions, how to handle unknown thread mask?

Question

This question is about adapting to the change in semantics from lock step to independent program counters. Essentially, what can I change calls like int __all(int predicate); into for volta.

For example, int __all_sync(unsigned mask, int predicate);

with semantics:

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

The docs assume that the caller knows which threads are active and can therefore populate mask accurately.

a mask must be passed that specifies the threads participating in the call

I don't know which threads are active. This is in a function that is inlined into various places in user code. That makes one of the following attractive:

__all_sync(UINT32_MAX, predicate);
__all_sync(__activemask(), predicate);

The first is analogous to a case declared illegal at https://forums.developer.nvidia.com/t/what-does-mask-mean-in-warp-shuffle-functions-shfl-sync/67697, quoting from there:

For example, this is illegal (will result in undefined behavior for warp 0):

if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);

The second choice, this time quoting from __activemask() vs __ballot_sync()

The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.

The operating semantics appear to be:

There is a warp of N threads
M (M <= N) threads are enabled by compile time control flow
D (D subset of M) threads are converged, as a runtime property
__activemask returns which threads happen to be converged

That suggests synchronising threads then using activemask,

__syncwarp();
__all_sync(__activemask(), predicate);

An nvidia blog post says that is also undefined, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

Calling the new __syncwarp() primitive at line 10 before __ballot(), as illustrated in Listing 11, does not fix the problem either. This is again implicit warp-synchronous programming. It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.

That marks the end of my ideas. That same blog concludes with some guidance on choosing a value for mask:

Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.

Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.

Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.

However, I can't compute what the mask should be. It depends on the control flow at the call site that the code containing __all_sync was inlined into, which I don't know. I don't want to change every function to take an unsigned mask parameter.

How do I retrieve semantically correct behaviour without that global transform?

Robert Crovella · Accepted Answer · 2020-11-10T20:46:46.233

TL;DR: In summary, the correct programming approach will most likely be to do the thing you stated you don't want to do.

Longer:

This blog specifically suggests an opportunistic method for handling an unknown thread mask: precede the desired operation with __activemask() and use that for the desired operation. To wit (excerpting verbatim from the blog):

int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);

That should be perfectly legal.

You might ask "what about item 2 mentioned at the end of the blog?" I think if you read that carefully and taking into account the previous usage I just excerpted, it's suggesting "don't just use __activemask()" if you intend something different. That reading seems evident from the full text there. That doesn't abrogate the legality of the previous construct.

You might ask "what about incidental or enforced divergence along the way?" (i.e. during the processing of my function which is called from elsewhwere)

I think you have only 2 options:

grab the value of __activemask() at entry to the function. Use it later when you call the sync operation you desire. That is your best guess as to the intent of the calling environment. CUDA doesn't guarantee that this will be correct, however this should certainly be legal if you don't have enforced divergence at the point of your sync function call.
Make the intent of the calling environment clear - add a mask parameter to your function and rewrite the code everywhere (which you've stated you don't want to do).

There is no way to deduce the intent of the calling environment from within your function, if you permit the possibility of warp divergence prior to entry to your function, which obscures the calling environment intent. To be clear, CUDA with the Volta execution model permits the possibility of warp divergence at any time. Therefore, the correct approach is to rewrite the code to make the intent at the call site explicit, rather than trying to deduce it from within the called function.

Thank you. Spent a few days thinking about this. I think the explicit CFG, where the mask is computed on branches in application code, probably is required by volta. On pre-volta cards, I can set that same mask by activemask(), so the existing entry points can call into the new ones. — Jon Chesterfield, Nov 15 '20 at 11:36

Cuda _sync functions, how to handle unknown thread mask?

1 Answers1