In asm volatile inline PTX instructions, why also specify "memory" side effecs?

Question

Consider the following excerpt from CUDA's Inline PTX Assebly guide (v10.2):

The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that the asm is not deleted or moved during generation of PTX, you should use the volatile keyword, e.g.:
asm volatile ("mov.u32 %0, %%clock;" : "=r"(x));
Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon...

It sounds like both volatile and :: "memory" are intended to indicate side effects in memory. Now, granted, there could be non-memory side effects (like for trap;). But - when I've used volatile, isn't it useless/meaningless to also specify :: "memory")?

_{slightly related: When using inline PTX asm() instructions, what does 'volatile' do?}

I think this is a duplicate of [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259). They're talking about memory that *isn't* an `"=m"` output operand, merely pointed-to by an input pointer operand. — Peter Cordes, Apr 29 '20 at 10:10
Also note that a `"memory"` clobber does *not* stop an asm statement from optimizing away if none of its explicit output operands are used, so you do need `volatile`. (Or with no `"=..."` operands it's implicitly volatile in GNU C, IDK about CUDA). So a non-`volatile` asm statement with a memory clobber has to be assumed to modify any reachable memory at that point in the C abstract machine, but only *if* it's not optimized away, and that decision is only based on its explicit operands being used. non-volatile asm can also be hoisted out of loops if the inputs are the same (pure function) — Peter Cordes, Apr 29 '20 at 10:14
@PeterCordes: Not at all. This question is about inline PTX, which does not necessarily follow the same rules as inline assembly in GCC. However, it is possible that some part of the answer to the question you linked to is relevant here; as well as your comment. Still - a person with my question would absolutely not guess by the title and body of the linked-to question that they may find their answer there. — einpoklum, Apr 29 '20 at 10:18
It's pretty clear that CUDA uses identical syntax to GNU C inline asm (which is portable to all ISAs including GPUs). It's also clear to me that their description exactly matches the situation for normal C, where you need to communicate the inputs and outputs to the compiler in precise details, and a memory operand not listed as an input or output constraint requires a `"memory"` clobber or other workaround. You need the compiler to assume that memory may have been read or written, and spill/reload vars from registers around such an asm statement. — Peter Cordes, Apr 29 '20 at 10:22
The second comment from @PeterCordes is spot on. They are directives to different stages of the compilation trajectory and are orthogonal — talonmies, Apr 29 '20 at 10:23
Doesn't CUDA usually use LLVM? It would make sense that the same compiler uses the same inline asm syntax and semantics. — Peter Cordes, Apr 29 '20 at 10:23
@talonmies: note that this question is asking the reverse: it recognizes that `volatile` is necessary, but is asking why that doesn't imply a `"memory"` clobber or something. Oh, on more careful reading, I see the mixup. `volatile` does *not* indicate anything about affecting memory, just that it's not a pure function of the inputs. — Peter Cordes, Apr 29 '20 at 10:25
@talonmies: expanded my comments into an answer. I know next to nothing about CUDA, but this `asm` syntax and its semantics were designed for C in the first place so that's maybe a better way to explain it anyway. But it'd be good to have someone who knows any CUDA take a look in case any of my explanation is misleading or wrong for CUDA. edits welcome. — Peter Cordes, Apr 29 '20 at 11:06

Peter Cordes · Accepted Answer · 2020-04-30T20:10:32.417

A non-volatile inline asm statement is treated as a pure function of its inputs: gives the same output every time when run with the same explicit inputs.

And separately, without a "memory" clobber: doesn't read or write anything that hasn't been mentioned as an input or output operand.

It sounds like both volatile and :: "memory" are intended to indicate side effects in memory.

No, volatile just means that the output operands are not a pure function of the input operands. A "memory" clobber is mostly orthogonal and is not implied by volatile

The example you quoted appears to be reading a %%clock cycle counter or something which needs to re-execute every time, otherwise the compiler could CSE and hoist it out of a loop. You wouldn't want that to force the compiler to spill/reload any global vars it had in registers. volatile doesn't imply memory side-effects so it's just the ticket for this use-case.

It would still be a bug for the asm template to read or write any other variables behind the compiler's back (not via explicit "m", "=m", or "+m" operands) because volatile doesn't imply a "memory" clobber.

In GNU C inline asm even an "r"(pointer_variable) does not imply that the pointed-to data is read or written. e.g. an assignment can be optimized away as a dead stores if all you do with the variable is pass a pointer to it as an input to an asm statement without a "memory" clobber. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

A "memory" clobber will get the compiler to assume that any globally-reachable memory (or reachable via pointer inputs) may have been read or written, and thus spill/reload vars from registers around such an asm statement. (Unless escape analysis can prove that nothing else could have a pointer to them, i.e. that a pointer to the var hasn't "escaped" the local scope. Just like how compilers decide they can keep a var in a register across a non-inline function call.)

So is "memory" alone safe without volatile? No

A "memory" clobber does not stop an asm statement from optimizing away if none of its explicit output operands are used. (With no "=..." operands, an asm statement is implicitly volatile).

A non-volatile asm statement with a memory clobber has to be assumed to modify any reachable memory at that point in the abstract machine if/when the asm template string executes, but the compiler is still free to make transformations that result in that not happening at all, or less often than the source would. (e.g. hoist it out of a loop if the other vars that change in the loop are all locals whose address hasn't escaped the function.)

A non-volatile asm statement is still assumed to be a pure function wrt. its explicit inputs and outputs, so asm("..." : "=r"(out) : "r"(in) : "memory"); could be hoisted out of a loop if the loop used the same "in" every iteration. (This could only happen if the loop variables were all locals which the asm statement couldn't have a pointer to (escape analysis like for a non-inline function call). Otherwise the "memory" clobber would block that reordering.)

Or optimized away entirely if all uses of "out" can be optimized away, regardless of any memory accesses around the statement. The decision is only based on the explicit operands if you omit volatile.

There's not a lot of use-case for a "memory" clobber without volatile; you could imagine using it to describe a function that internally uses a cache to memoize results. The compiler can run it as often or as infrequently as it wants, and we don't actually care whether the internal cache got mutated or not. It's a side effect but not a valuable side effect.

(I'm assuming that CUDA inline asm has identical semantics to GNU C inline asm as supported/implemented by Clang/LLVM and by GCC. From the quote that certainly appears to be the case. I don't really know anything about CUDA so everything I said above is based on GNU C inline asm, because CUDA asm appears to be identical. Correct me if I'm wrong, e.g. if asm statements with no output operands are not implicitly volatile or if CUDA doesn't have pointers.

Since GNU C inline asm syntax was designed for C and later repurposed for CUDA instead, it may help your understanding of the design to think in terms of C including pointers and escape analysis.)

In asm volatile inline PTX instructions, why also specify "memory" side effecs?

1 Answers1