Understanding volatile asm vs volatile variable

Question

We consider the following program, that is just timing a loop:

#include <cstdlib>

std::size_t count(std::size_t n)
{
#ifdef VOLATILEVAR
    volatile std::size_t i = 0;
#else
    std::size_t i = 0;
#endif
    while (i < n) {
#ifdef VOLATILEASM
        asm volatile("": : :"memory");
#endif
        ++i;
    }
    return i;
}

int main(int argc, char* argv[])
{
    return count(argc > 1 ? std::atoll(argv[1]) : 1);
}

For readability, the version with both volatile variable and volatile asm reads as follow:

#include <cstdlib>

std::size_t count(std::size_t n)
{
    volatile std::size_t i = 0;
    while (i < n) {
        asm volatile("": : :"memory");
        ++i;
    }
    return i;
}

int main(int argc, char* argv[])
{
    return count(argc > 1 ? std::atoll(argv[1]) : 1);
}

Compilation under g++ 8 with g++ -Wall -Wextra -g -std=c++11 -O3 loop.cpp -o loop gives roughly the following timings:

default: 0m0.001s
-DVOLATILEASM: 0m1.171s
-DVOLATILEVAR: 0m5.954s
-DVOLATILEVAR -DVOLATILEASM: 0m5.965s

The question I have is: why is that? The default version is normal since the loop is optimized away by the compiler. But I have harder time understanding why -DVOLATILEVAR is way longer than -DVOLATILEASM since both should force the loop to run.

Compiler explorer gives the following count function for -DVOLATILEASM:

count(unsigned long):
  mov rax, rdi
  test rdi, rdi
  je .L2
  xor edx, edx
.L3:
  add rdx, 1
  cmp rax, rdx
  jne .L3
.L2:
  ret

and for -DVOLATILEVAR (and the combined -DVOLATILEASM -DVOLATILEVAR):

count(unsigned long):
  mov QWORD PTR [rsp-8], 0
  mov rax, QWORD PTR [rsp-8]
  cmp rdi, rax
  jbe .L2
.L3:
  mov rax, QWORD PTR [rsp-8]
  add rax, 1
  mov QWORD PTR [rsp-8], rax
  mov rax, QWORD PTR [rsp-8]
  cmp rax, rdi
  jb .L3
.L2:
  mov rax, QWORD PTR [rsp-8]
  ret

Why is the exact reason of that? Why does the volatile qualification of the variable prevents the compiler from doing the same loop as the one with asm volatile?

`volatile` in C and C++ basically means "don't optimize away this variable and don't optimize away loads and stores to this variable", so in short "disable optimizations for this one" and not really much else. In my opinion, use of `volatile` is best avoided and in 98+% of the cases I've ever seen it used it has been used incorrectly based on some false assumption that it also means "atomic" or something else it is not.. Seeing it in code reviews is a huge red flag. — Jesper Juhl, Jun 19 '18 at 19:35
Possible duplicate of [The difference between asm, asm volatile and clobbering memory](https://stackoverflow.com/questions/14449141/the-difference-between-asm-asm-volatile-and-clobbering-memory) — Ripi2, Jun 19 '18 at 19:36
Maybe relevant: https://www.kernel.org/doc/html/v4.10/process/volatile-considered-harmful.html — Jesper Juhl, Jun 19 '18 at 20:17
@JesperJuhl I do not understand the *best avoided* phrase in your context. `volatile` is a tool, and there are cases when this tool is mandated. When the `volatile` is used when it is not mandated, it's a bug, so is when it's not used when it has to. That's not comparative (best/worst), but binary - bug or not a bug. — SergeyA, Jun 19 '18 at 20:27
@JesperJuhl If the CPU makes something atomic, then volatile may mean atomic on the right datatype. In any case, it makes the assembly code more predictable, which has a lot of practical uses. — curiousguy, Jun 24 '18 at 02:43
@SergeyA I believe "best avoided" is a way (inappropriate way IMO) of saying that a tool is soooo often misused, and soooo rarely used appropriately, if you are using it, you are "probably" misusing it (not a true mathematical probability). Indeed, many people have used volatile in multithread code to do things volatile can't do; that doesn't imply that volatile cannot be used in a MT program. — curiousguy, Jul 09 '18 at 00:49

NathanOliver · Accepted Answer · 2018-06-19T20:29:29.987

3

When you make i volatile you tell the compiler that something that it doesn't know about can change its value. That means it is forced to load it's value every time you use it and it has to store it every time you write to it. When i is not volatile the compiler can optimize that synchronization away.

edited Jun 19 '18 at 20:29

answered Jun 19 '18 at 19:37

NathanOliver

171,901
28
288
402

1

Addition - it is also required to commit it every time you write it. – SergeyA Jun 19 '18 at 20:28
In particular, when the variable is not `volatile` it can be left in a register, so the compiler can avoid accessing the memory for each iteration. – Matteo Italia Jun 19 '18 at 21:08

Peter Cordes · Answer 2 · 2018-06-20T13:49:10.503

1

-DVOLATILEVAR forces the compiler to keep the loop counter in memory, so the loop bottlenecks on the latency of store/reload (store forwarding), ~5 cycles + the latency of an add 1 cycle.

Every assignment to and read from volatile int i is considered an observable side-effect of the program that the optimizer has to make happen in memory, not just a register. This is what volatile means.

There's also a reload for the compare, but that's only a throughput issue, not latency. The ~6 cycle loop carried data dependency means your CPU doesn't bottleneck on any throughput limits.

This is similar to what you'd get from -O0 compiler output, so have a look at my answer on Adding a redundant assignment speeds up code when compiled without optimization for more about loops like that, and x86 store-forwarding.

With only VOLATILEASM, the empty asm template (""), has to run the right number of times. Being empty, it doesn't add any instructions to the loop, so you're left with a 2-uop add / cmp+jne loop that can run at 1 iteration per clock on modern x86 CPUs.

Critically, the loop counter can stay in a register, despite the compiler memory barrier. A "memory" clobber is treated like a call to a non-inline function: it might read or modify any object that it might possibly have a reference to, but that does not include local variables that have never had their address escape the function. (i.e. we never called sscanf("0", "%d", &i) or posix_memalign(&i, 64, 1234). But if we did, then the "memory" barrier would have to spill / reload it, because an external function could have saved a pointer to the object.

i.e. a "memory" clobber is only a full compiler barrier for objects that could possibly be visible outside the current function. This is really only an issue when messing around and looking at compiler output to see what barriers do what, because a barrier can only matter for multi-threading correctness for variables that other threads could possible have a pointer to.

And BTW, your asm statement is already implicitly volatile because it has no output operands. (See Extended-Asm#Volatile in the gcc manual).

You can add a dummy output to make a non-volatile asm statement the compiler can optimize away, but unfortunately gcc still keep the empty loop after eliminating a non-volatile asm statement from it. If i's address has escaped the function, removing the asm statement entirely turns the loop into a single compare jump over a store, right before the function returns. I think it would be legal to simply return without ever storing to that local, because there's no a correct program can know that it managed to read i from another thread before i went out of scope.

But anyway, here's the source I used. As I said, note that there's always an asm statement here, and I'm controlling whether it's volatile or not.

#include <stdlib.h>
#include <stdio.h>

#ifndef VOLATILEVAR   // compile with -DVOLATILEVAR=volatile  to apply that
#define VOLATILEVAR
#endif

#ifndef VOLATILEASM  // Different from your def; yours drops the whole asm statement
#define VOLATILEASM
#endif

// note I ported this to also be valid C, but I didn't try -xc to compile as C.
size_t count(size_t n)
{
    int dummy;  // asm with no outputs is implicitly volatile
    VOLATILEVAR size_t i = 0;
    sscanf("0", "%zd", &i);
    while (i < n) {
        asm  VOLATILEASM ("nop # operand = %0": "=r"(dummy) : :"memory");
        ++i;
    }
    return i;
}

compiles (with gcc4.9 and newer -O3, neither VOLATILE enabled) to this weird asm. (Godbolt compiler explorer with gcc and clang):

 # gcc8.1 -O3   with sscanf(.., &i) but non-volatile asm
 # the asm nop doesn't appear anywhere, but gcc is making clunky code.
.L8:
    mov     rdx, rax  # i, <retval>
.L3:                                        # first iter entry point
    lea     rax, [rdx+1]      # <retval>,
    cmp     rax, rbx  # <retval>, n
    jb      .L8 #,

Nice job, gcc.... gcc4.8 -O3 avoids pulling an extra mov inside the loop:

 # gcc4.8 -O3   with sscanf(.., &i) but non-volatile asm
.L3:
    add     rdx, 1    # i,
    cmp     rbx, rdx  # n, i
    ja      .L3 #,

    mov     rax, rdx  # i.0, i   # outside the loop

Anyway, without the dummy output operand, or with volatile, gcc8.1 gives us:

 # gcc8.1  with sscanf(&i) and asm volatile("nop" ::: "memory")
.L3:
    nop # operand = eax     # dummy
    mov     rax, QWORD PTR [rsp+8]    # tmp96, i
    add     rax, 1    # <retval>,
    mov     QWORD PTR [rsp+8], rax    # i, <retval>
    cmp     rax, rbx  # <retval>, n
    jb      .L3 #,

So we see the same store/reload of the loop counter, only difference from volatile i being the cmp doesn't need to reload it.

I used nop instead of just a comment because Godbolt hides comment-only lines by default, and I wanted to see it. For gcc, it's purely a text substitution: we're looking at the compiler's asm output with operands substituted into the template before it's sent to the assembler. For clang, there might be some effect because the asm has to be valid (i.e. actually assemble correctly).

If we comment out the scanf and remove the dummy output operand, we get a register-only loop with the nop in it. But keep the dummy output operand and the nop doesn't appear anywhere.

edited Jun 20 '18 at 13:49

answered Jun 19 '18 at 21:16

Peter Cordes

328,167
45
605
847

"_This is what volatile means_" Who defined "volatile" has "not in a register"? – curiousguy Jun 21 '18 at 04:59
@curiousguy: ISO C++ did. ISO C++ only talks about memory, not registers, and says that `volatile` accesses have to actually access memory. * Note: volatile is a hint to the implementation to avoid aggressive optimization involving the object because the value of the object might be changed by means undetectable by an implementation* (http://eel.is/c++draft/dcl.type.cv#6), and says C++ is supposed to be like ISO C. – Peter Cordes Jun 21 '18 at 05:13
ISO C11 says: *Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine.* (i.e. memory) http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf N1570 5.1.2.3 Program execution, point 6, and later points go on to give some more details and examples. `volatile` is somewhat implementation-defined (e.g. what "a volatile access" means, e.g. is `(void)foo` an access?), but in GNU C it definitely means that values must be updated in memory, optimizing as if the memory location can change asynchronously (to support the MMIO use-case.) – Peter Cordes Jun 21 '18 at 05:13
"_ISO C++ only talks about memory, not registers_" So ISO C++ doesn't exclude registers. "_actually access memory_" Memory = RAM storage? – curiousguy Jun 21 '18 at 05:28
"_A "memory" clobber is treated like a call to a non-inline function_" That makes sense, and it's probably the only approach of "C/C++"/asm interfacing that makes sense, but where is that spelled out? The documentation says nothing about the scope of "memory" and the modification permitted. – curiousguy Jun 21 '18 at 05:33
@curiousguy: IDK if it's documented exactly what a `"memory"` clobber affects, but a gcc developer explained to me once that an `asm` statement without exlicit input operands is assumed not to be able to access things like local variables, exactly because it couldn't have a reference to them. (I forget if that was in a bug report or an SO comment). But anyway it should be the same logic as a non-inline function call, modulo possible diffs wrt. `static` variables. Using a `static` variable inside an asm template (instead of as an operand) is unsafe because it can be optimized away...) – Peter Cordes Jun 21 '18 at 05:47
@curiousguy: memory as in a location with an address. All objects have addresses in the C abstract machine. I think there's some mention of objects that have never had their address *taken* (i.e. candidate for the now-obsolete `register` keyword), but barring any of that, every object has an address. Registers don't have addresses, so aren't part of memory. – Peter Cordes Jun 21 '18 at 05:50
When a thread is suspended by the kernel (which is only time its state can reliably be observed), either by an explicit action of that thread like a syscall or an external event like an interrupt, or a timer, the state of all registers are saved in memory. – curiousguy Jul 09 '18 at 00:53
@curiousguy: How is that relevant to anything? Those kernel memory addresses aren't even part of the process's own virtual address space. Are you arguing that a `"memory"` clobber should include all the register values in case the process sleeps there when compiling for a target that does preemptive multitasking? The authors of the C++ standard allow cooperative-multitasking threading implementations; notice that infinite loops are undefined behaviour. Every thread is expected to *eventually* exit or interact with the outside world; those can be scheduling points. – Peter Cordes Jul 09 '18 at 01:12
"_Those kernel memory addresses aren't even part of the process's own virtual address space_" They are accessible with `ptrace` and co. under any reasonable OS. "_The authors of the C++ standard allow cooperative-multitasking threading implementations_" How is that relevant to anything? Those implementations would still save the exact same information somewhere, in the process memory instead of kernel memory. "_Every thread is expected to eventually exit or interact with the outside world; those can be scheduling points._" What isn't a "scheduling point"? – curiousguy Jul 09 '18 at 01:18
@curiousguy: *What isn't a "scheduling point"?* In a *cooperative* multi-threading system, like classic MacOS, only system calls are scheduling points. (That's a term I made up; there's probably an official one). My point was that compiling for a cooperative vs. preemptive multi-threading system doesn't change the meaning of `asm("":::"memory");`. If you want code-gen that assumes variable values are modified with `ptrace` between any two C statements, compile with `gcc -O0`; that's why it spills/reloads everything to memory between statements and keeps nothing in registers. – Peter Cordes Jul 09 '18 at 01:22
So, is an access to an atomic type a scheduling point? ("_That's a term I made up;_" I too make up terms all the time.) – curiousguy Jul 09 '18 at 01:24
1

@curiousguy: Oh, yes, [accessing an atomic also counts for avoiding UB](https://stackoverflow.com/questions/41320725/is-infinite-loop-still-undefined-behavior-in-c-if-it-calls-shared-library). An implementation could add a `yield()` system call before or after an `atomic` access if it wanted to, and that might be necessary for correctness in some cases (to avoid deadlock if one thread is spinning in a cmpxchg or other spin loop waiting for a condition in memory that another thread needs to create). A smart compiler could probably figure out when yields aren't required, or batch them up. – Peter Cordes Jul 09 '18 at 01:30
1

@curiousguy: that might not have been the main motivation for infloop UB, though. [N1528: Why undefined behavior for infinite loops?](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1528.htm) doesn't mention it, only optimization of loops with no side effects when it termination can't be proven. [Are compilers allowed to eliminate infinite loops?](https://stackoverflow.com/q/2178115). C++11 with threads on a cooperative multitasking system (without assistance from the programmer to manually insert `yield()` calls at appropriate places) is *possible* but inefficient without AI compilers.. – Peter Cordes Jul 09 '18 at 01:34
@curiousguy: anyway, this was interesting, but it still has no relevance to my answer, AFAICT. Compilers assume that registers aren't asynchronously modified with `ptrace`, regardless of implementation details of multitasking. A GNU C `"memory"` clobber is treated like a call to a function the compiler knows nothing about, so it has to sync actual memory with what the abstract machine would have. Every access to a `volatile` object in the source has to map to an access in the asm, and (in GNU C), even automatic-storage volatile objects always have an address. Preemption is transparent. – Peter Cordes Jul 09 '18 at 01:39
So when `asm` isn't (interpreted as) `volatile`, GCC is free to choose the assemble output of 8.1 or 4.8 (in your answer above)? – HCSF Jul 10 '18 at 17:07
@HCSF: Yes, exactly. The whole point of non-`volatile` is that it's a pure function. If it *does* need to run, then the `"memory"` clobber has to be respected, but the whole asm statement can optimize away entirely or be hoisted out of loops. – Peter Cordes Jul 10 '18 at 21:06
A "pure function" of which parameters? – curiousguy Jul 12 '18 at 22:39
1

@curiousguy: of the input constraints. (Or if there are no inputs, then it's assumed to produce a constant, like a pure function with no args.) – Peter Cordes Jul 12 '18 at 23:06

Understanding volatile asm vs volatile variable

2 Answers2

Linked