Summary: DoNotOptimize
is ordered wrt. time()
by the the "memory"
clobbers, as if it were another function call to an opaque function that could modify any global state.
DoNotOptimize
is ordered wrt. the computation of output from input by the data dependency of the calculation on the input, and the output on the calculation, as Chandler Carruth explained in the Q&A you linked. The "memory"
clobber is irrelevant for this part.
"memory"
clobber is like a non-inline function call
DoNotOptimize
's asm
statement contains a "memory"
clobber. As far as the optimizer is concerned, that's equivalent to an opaque function call: it has to be assumed to read and write every globally-reachable object1. (Even ones this compilation unit might not know about.)
Since time()
itself doesn't have an inline definition in any header, it can't reorder with DoNotOptimize
at compile time for the same reason that a compiler can't reorder calls to foo()
and bar()
when it can't see the definitions of those functions. Same reason compilers don't need any special logic to stop them from reordering puts("hi"); puts("mom");
.
(A hypothetical time()
that could inline and only contained an asm
statement would have to use asm volatile
to make sure repeated calls didn't just use the first one's output. asm volatile
statements can't reorder with each other or accesses to volatile
variables, so that would be ok too, for a different reason.)
Footnote 1: Globally reachable = any object that might be pointed-to by any hypothetical global variable. i.e. anything except local variables within this function, or memory freshly allocated with new
, if escape analysis can prove that nothing outside this function could have pointers to them.
How the asm
statement works
I think you're pretty seriously misunderstanding how the asm works. "+r,m"
tells the compiler to materialize the value in a register (or memory if it wants), and then use the value there at the end of the (empty) asm template as the new value of that C++ object.
So it forces the compiler to actually materialize (produce) the value somewhere, which means it has to be computed. And it means has to forget what it previously knew about the value (e.g. that it was a compile time constant 5, or non-negative, or anything) because the "+"
modifier declares a read/write operand.
The point of DoNotOptimize
on the input is to defeat constant-propagation that would let the benchmark optimize away.
And on the output to make sure a final result is actually materialized in a register (or memory) instead of optimizing away all the computation leading to an unused result. (This is where being asm volatile
is relevant; defeating constant-propagation still works with non-volatile asm
.)
So the computation you want to benchmark has to happen between the two DoNotOptimize()
statements, and separately those two statements can't reorder with time()
.
The compiler has to assume that the asm statement modifies the value like val ^= random
for all it knows, along with changing the value in memory of any/every other object except for private locals that weren't operands, so e.g. the "memory"
clobber doesn't stop the compiler from keeping a local loop counter in memory. (It doesn't special case an empty asm template string here; programs don't contain asm statements like this by accident so nobody wants them optimized away.)
Misconceptions about the reference arg and picking "m"
I only got part way into the details of your attempt to reason about the "+r,m"
operand and the reference function-arg before deciding it would probably be better to just explain from scratch. The correct reason isn't that complicated. But a couple things are worth specifically correcting:
The C++ function containing the asm
statement can inline, letting the by-reference function arg optimize away. (It's even declared inline __attribute__((always_inline))
to force inlining even with optimization disabled, although in that case the reference variable won't optimize away.)
The net result is as if the asm statement were used directly on the C++ variable passed to DoNotOptimize. e.g. DoNotOptimize(foo)
is like asm volatile("" : "+r,m"(foo) :: "memory")
The compiler can always pick register if it wants to, e.g. choosing to load a variable's value into a register before an asm
statement. (And if the C++ semantics demand updating the variable's value in memory, also emitting a store instruction after the asm statement.)
For example, we can see that GCC does choose to do that. (I guess I could have used incl %0
as the example, but I just chose nop
as a way to show what the compiler picked for the operand location as an alternative to # %0
pure comment, so the Godbolt compiler explorer wouldn't filter it out.)
void foo(int *p)
{
asm volatile("nop # operand picked %0" : "+r,m" (p[4]) );
}
# GCC 11.2 -O2
foo(int*):
movl 16(%rdi), %eax
nop # operand picked %eax
movl %eax, 16(%rdi)
ret
vs. clang choosing to leave the value in memory, so every instruction in the asm template would be accessing memory instead of a register. (If there were any instructions).
# clang 12.0.1 -O2 -fPIE
foo(int*): # @foo(int*)
nop # operand picked 16(%rdi)
retq
Fun fact: "r,m"
is an attempt to work around a clang missed-optimization bug that makes it always pick memory for "rm"
constraints, even if the value was already in a register. Spilling it first, even if it has to invent a temporary location for the value of an expression as an input.