CUDA critical sections, thread/warp execution model and NVCC compiler decisions

Question

Recently I posted this question, about a critical section. Here is a similar question. In those questions the given answer says, that is up to the compiler if the code "works" or not, because the order of the various paths of execution is up to the compiler.

To elaborate the rest of the question I need the following excerpts from The CUDA programming guide:

... Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently....

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path....

The execution context (program counters, registers, etc.) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.

What I understand from this three excerpts is that, threads can diverge freely from the rest, all the branch possibilities will be serialized if there is divergence between threads, and if a branch is taken it will execute till completion. And that is why the questions mentioned above ends on deadlock, because the ordering of the execution paths imposed by the compiler, results in the taking of a branch that doesn't get the lock.

Now the question is: the compiler shouldn't always put the branches in the order written by the user?, is there a high level way to enforce the order? I know, the compiler can optimize, do a reordering of the instructions, etc, but it should not fundamentally change the logic of the code (yes there are exceptions like some memory access without the volatile keyword, but that is why the keyword exists, to give control to the user).

Edit

The main point of this question is not about critical sections, is about the compiler, for example in the first link, a compilation flag change drastically the logic of the code. One "working", and the other doesn't. What bothers me, is that in all the reference, it only says be careful, nothing about undefined behaviour from the nvcc compiler.

I really don't understand what you are trying to ask. There is no such thing as "compiler serialization", serialization is a warp level runtime phenomena. Tne compiler just compiles code as if it is single threaded — talonmies, Jan 05 '19 at 21:16
The compiler knows nothing about a warp (or the existence of multiple threads, even). A given thread follows only one path through the code. What you are asking about is what is the execution order of one thread with respect to another, and 1. The compiler knows nothing about that 2. CUDA provides no guarantees of thread execution order. If you write an if-condition inside of a while statement body, the compiler will certainly arrange to have the while boolean condition tested before the if boolean condition. Beyond that, not much can be said. You're trying to build a house on sand. — Robert Crovella, Jan 05 '19 at 21:17
You've been given the advice already to use a threadblock-level critical section negotiation, such as [here](https://stackoverflow.com/questions/18963293/cuda-atomics-change-flag/18968893#18968893). archaeasoftware, a CUDA expert, effectively is recommending [the same thing](https://books.google.com/books?id=KUxsAQAAQBAJ&pg=PA153&lpg=PA153&dq=cuda+handbook+critical+section+global+lock&source=bl&ots=Zks9-6Pwq-&sig=z4f_VQttwXEBWU1BMHm500aRY1A&hl=en&sa=X&ved=2ahUKEwi6wYL7yNffAhXPGTQIHe1TAwEQ6AEwA3oECAQQAQ#v=onepage&q=cuda%20handbook%20critical%20section%20global%20lock&f=false). — Robert Crovella, Jan 05 '19 at 21:21
@talonmies you are right, it was a poor choice of words in that case, i was referring to the order of the execution paths, as logically written by the user. — fabian_mc, Jan 05 '19 at 21:22
Quoting from the cuda handbook linked in my previous comment: "Expected usage is for one thread in each block to acquire the spinlock, otherwise the divergent code execution tends to deadlock" A fully analyzed (down to SASS execution level) example of how deadlock may arise is already given in the [question you linked](https://stackoverflow.com/questions/31194291/cuda-mutex-why-deadlock) — Robert Crovella, Jan 05 '19 at 21:25
@RobertCrovella this question is not particularly about that, from your previous answers I know that I should not try thread level critical sections. The question is about why the compiler changes the behaviour so drastically. — fabian_mc, Jan 05 '19 at 21:26
@RobertCrovella but the no guarantees, refers only to runtime or not? — fabian_mc, Jan 05 '19 at 21:30
There is no such thing as compiler controlled order of execution either. That is completely determined at runtime. If you are asking about thread level branch execution order, that is controlled by the compiler, but not documented and not programmer controllable. I still have no idea what you are trying to ask here, but I suspect the answer is a simple combination of "no", "unknown", "undefined", "not possible", "doesn't happen" and "can't be done". — talonmies, Jan 05 '19 at 21:35
Ok sorry, it may be bad English from me. What I mean from execution order, is the instruction order. Look the second link. — fabian_mc, Jan 05 '19 at 21:40
Let's take your first question: "the compiler shouldn't always put the branches in the order written by the user?" Where is that written? If I have a compiler, and give it an if-then-else statement to compile, where is it written that the compiler must first test for true, and if true jump to the then-body, before it tests for false and jumps to the else-body? In the context of a single thread, that is all irrelevant, and I'm unaware of any language rules that enforce low-level code behavior like this on the compiler. If you know of one, please point it out. — Robert Crovella, Jan 05 '19 at 21:42
From the SASS code provided in your answer, the compiler did put the branch instruction for the while later. I am not trying to say that it should try true first, what I am trying to say is that it should not be undefined behavior, in the sense that is conditional on how the compiler decide to other the instructions, one "working" and the other doesn't working. — fabian_mc, Jan 05 '19 at 21:58
What bothers me, is not that my code doesn't work, is that the compiler produce highly different results for the same code. — fabian_mc, Jan 05 '19 at 22:01
Yes, it's true that slight changes in low-level instruction ordering, which are essentially meaningless from the standpoint of a single thread, can have very different results in a multi-threaded environment. This is particularly true in the CUDA lock-step warp execution model. It's why there are probably dozens of questions about acquiring locks and critical sections on the CUDA tag. The CUDA programming model has recently introduced cooperative groups which are intended to be a more robust abstraction for controlling threads that are cooperating. The volta execution model also helps here — Robert Crovella, Jan 05 '19 at 22:10

einpoklum · Answer 1 · 2019-02-24T19:29:44.383

1

I believe the order of execution is not set, nor guaranteed, by the CUDA compiler. It's the hardware that sets it - as far as I can recall.

Thus,

the compiler shouldn't always put the branches in the order written by the user?

It doesn't control execution order anyway

is there a high level way to enforce the order?

Just the synchronization instructions like __syncthreads().

The compiler... should not fundamentally change the logic of the code

The semantics of CUDA code is not the same as for C++ code... sequential execution of if branches is not part of the semantics.

I realize this answer may not be satisfying to you, but that's how things stand, for better or for worse.

edited Feb 24 '19 at 19:29

answered Jan 09 '19 at 14:29

einpoklum

118,144
57
340
684

But aren't the branches serialized in execution (in the sense, that when one branch is taken and finishes, the other branch will execute)?... What bothers me, is that the SASS code changes, for example in this [question](https://stackoverflow.com/questions/54032663/nvcc-g-device-debug-flag-breaks-cuda-code-with-critical-section), a compilation flag makes the code incur in deadlock, but without it "works", shouldn't the compiler be consistent?, or at least be a standard that says it's undefined behavior?. – fabian_mc Jan 10 '19 at 15:27
@fabian_mc: It's a useful metaphor to think of the branches as being serialized, but I don't recall it's guaranteed. Just think that if an SM could execute instructions of half-warps (while synching as necessary with the other half), then if only lanes in the first half warp take the "then" branch and only threads in the other half take the "else" branch, the execution could happen in parallel. – einpoklum Jan 11 '19 at 00:17
Also: A compiler should be consistent only in the sense that if you compile the same code with the same version of the compiler and the same settings twice you should get the same compiled code. Change any of these and a different result is not inherently inconsistent. – einpoklum Jan 11 '19 at 00:18
Ok, the execution in parallel of a branch is not possible according to the programming guide (check paragraph 3 above, it talks about scheduling a warp with active threads), also check the section of [independent thread scheduling](https://devblogs.nvidia.com/inside-volta/), in that link it explicitly says that divergent branches are serialized... What you are describing for the compiler wouldn't be, that it is deterministic? – fabian_mc Jan 11 '19 at 12:21
@fabian_mc: Paragraph 3 does sound rather clear cut. However, at the link you gave to the Volta white paper, look at the [diagram](https://devblogs.nvidia.com/wp-content/uploads/2017/05/image3.png) for a single SM. Do you see how it has four "sub-SM"s? And how every one of them can schedule 16 lanes for an integer operation or 8 for a floating-point operation? – einpoklum Jan 11 '19 at 14:21
There are two diagrams, pre volta, and volta, in the pre volta diagram the branches are completely serialized. – fabian_mc Jan 11 '19 at 22:20

CUDA critical sections, thread/warp execution model and NVCC compiler decisions

1 Answers1