How compiler enforces C++ volatile in ARM assembly

Question

According to cppreference, store of one volatile qualified cannot be reordered wrt to another volatile qualified variable. In other words, in the below example, when y becomes 20, it is guaranteed that x will be 10.

volatile int x, y;
...
x = 10;
y = 20;

According to Wikipedia, ARM processor a store can be reordered after another store. So, in the below example, second store can be executed before first store since both destinations are disjoint, and hence they can be freely reordered.

str     r1, [r3]
str     r2, [r3, #4]

With this understanding, I wrote a toy program:

volatile int x, y;

int main() {
    x = 10;
    y = 20;
}

I expected some fencing to be present in the generated assembly to guarantee the store order of x and y. But the generated assembly for ARM was:

main:
        movw    r3, #:lower16:.LANCHOR0
        movt    r3, #:upper16:.LANCHOR0
        movs    r1, #10
        movs    r2, #20
        movs    r0, #0
        str     r1, [r3]
        str     r2, [r3, #4]
        bx      lr
x:
y:

So, how storing order is enforced here?

`volatile` accesses forbid *compile-time* reordering, which is normally sufficient for MMIO accesses to uncacheable memory. Not run-time. Using [`volatile` for inter-thread communication](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/58535118#58535118) is not recommended post C++11, but is similar to rolling your own atomic load/store with `std::memory_order_relaxed`, because there are no run-time ordering guarantees or ordering wrt. non-volatile accesses. — Peter Cordes, Jul 05 '22 at 14:20
the store of x goes out before the store of y, if those could go out of order it would be outside the processor and instruction set. Now granted this is a C++ question specifically but certainly for C what volatile means is opinion based and as such implementation defined. clang and gcc have a different opinion of volatile for example and can generate different code. — old_timer, Jul 05 '22 at 14:22
the code generated looks correct from the high level code (using an anchor) — old_timer, Jul 05 '22 at 14:24
@PeterCordes How preventing compile-time reordering helps MMIO operations while run-time reordering has no effect? — Sourav Kannantha B, Jul 05 '22 at 14:30
Uncacheable memory regions used for MMIO normally have stronger memory-ordering semantics than normal write-back-cacheable. — Peter Cordes, Jul 05 '22 at 14:34
In practice (and in my theory) volatile only provides guarantees to programs under `ptrace` control; `ptrace` will only show and change the memory from CPU POV. The RAM POV may be entirely different; RAM may even never see short term volatile variables. — curiousguy, Dec 02 '22 at 21:44

Useless · Accepted Answer · 2022-07-05T14:36:18.390

7

so, in the below example, second store can be executed before first store since both destinations are disjoint, and hence they can be freely reordered.

The volatile keyword limits the reordering (and elision) of instructions by the compiler, but its semantics don't say anything about visibility from other threads or processors.

When you see

        str     r1, [r3]
        str     r2, [r3, #4]

then volatile has done everything required. If the addresses of x and y are I/O mapped to a hardware device, it will have received the x store first. If an interrupt pauses operation of this thread between the two instructions, the interrupt handler will see the x store and not the y. That's all that is guaranteed.

The memory ordering model only describes the order in which effects are observable from other processors. It doesn't alter the sequence in which instructions are issued (which is the order they appear in the assembly code), but the order in which they are committed (ie, a store becomes externally visible).

It is certainly possible that a different processor could see the result of the y store before the x - but volatile is not and never has been relevant to that problem. The cross-platform solution to this is std::atomic.

There is unfortunately a load of obsolete C code available on the internet that does use volatile for synchronization - but this is always a platform-specific extension, and was never a great idea anyway. Even less fortunately the keyword was given exactly those semantics in Java (which isn't really used for writing interrupt handlers), increasing the confusion.

If you do see something using volatile like this, it's either obsolete or was incompetently translated from Java. Use std::atomic, and for anything more complex than simple atomic load/store, it's probably better (and is certainly easier) to use std::mutex.

edited Jul 05 '22 at 14:36

answered Jul 05 '22 at 14:22

Useless

64,155
6
88
132

If second instruction executes before first, then how an interrupt handler will see the `x` store and not `y` store? You are even saying other threads may see `y` store before `x` store. What is the difference between another thread and a interrupt handler since both preemptively pause the execution. – Sourav Kannantha B Jul 05 '22 at 14:27
1

The interrupt handler is running on the same core, with the same instruction pipeline and L1 cache: store reordering is defined to be transparent _within_ a hardware thread, because otherwise no single-threaded code could possibly work. – Useless Jul 05 '22 at 14:30
1

Real CPUs use a [store buffer](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram) to decouple L1d cache misses and updates from exec and retirement. This allows speculative execution of stores, and also means stores don't have to commit to L1d cache until *after* retirement. – Peter Cordes Jul 05 '22 at 14:30
In computer architecture, "retired" normally means the order they leave the out-of-order back-end (or the end of an in-order pipe). That's in-order, even on an OoO exec CPU, to maintain a consistent state that we can roll back to at any point on exceptions or interrupts. But commit from the store buffer to L1d can be out-of-order if the mem model allows. – Peter Cordes Jul 05 '22 at 14:30
1

@SouravKannanthaB: A single core always preserves the illusion of executing instructions in program order, for the one thread that's running on it. That's the cardinal rule of out-of-order execution. In this case, the relevant mechanism is that loads snoop the store buffer, and do store-forwarding from any stores older than them which (partially) overlap with the load. e.g. on x86, see https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ – Peter Cordes Jul 05 '22 at 14:33
Note that in the case of hardware devices, the memory in question should have been marked as [device memory](https://developer.arm.com/documentation/102376/0100/Device-memory), which disables caching as well as memory reordering for that region. So in that case, it does suffice to simply execute the instructions in the correct program order, which is exactly what `volatile` ensures. – Nate Eldredge Oct 30 '22 at 17:51

fuz · Answer 2 · 2022-08-12T17:30:20.557

The key to understanding volatile is that it's not a tool to obtain defined concurrency semantics (and indeed, unsynchronised concurrent access to volatile variables constitutes undefined behaviour), but rather to perform memory accesses that could have side effects the compiler is not aware of.

This is the reason volatile was originally introduced: while Ken Thompson's original C compiler did not perform any significant transformations that would eliminate or change memory accesses, other compiler vendors developed such optimisations and found that they would break C code accessing hardware registers and the like. So volatile was introduced as a vendor extension to indicate “do not optimise accesses to this variable, I'm doing something the compiler doesn't understand.”

Such variables come in four main flavours:

memory accesses that have side effects or are not idempotent, e.g. to hardware registers
memory accesses that should not be optimised away even if the compiler sees no use for them, e.g. an accumulator used for a running sum in a benchmark, where the compiler may optimise out the entire benchmark if it finds that it can discard the accumulator
variables that may be concurrently modified during handling of an asynchronous signal (use volatile sig_atomic_t for these)
memory accesses to variables that may be modified by external means unknown to the compiler, e.g. variables that you want to change at runtime using a debugger or other tool

As the other answers have already noted, before the introduction of std::atomic and well-defined concurrency semantics in C and C++, the volatile qualifier was the best thing to use for atomic variables that could be modified by other threads concurrently. The precise semantics of volatile in this regard were never really well-defined, but telling the compiler that “I know what I'm doing” and using appropriate compiler-specific synchronised access functions and memory barriers would usually do the trick in practice.

But ever since the introduction of std::thread and std::atomic, volatile is no longer the right tool for this task. You'll see it being used in lots of legacy code though.

Peter Cordes · Answer 3 · 2022-10-30T06:55:10.660

volatile accesses only forbid compile-time reordering, not run-time. That's normally sufficient for MMIO accesses to uncacheable memory. (Uncacheable MMIO accesses normally have stronger memory-ordering semantics than cacheable memory.)

volatile is only the right tool for the job for MMIO access, or for getting well-defined semantics within a single thread (e.g. wrt. a signal handler via volatile sig_atomic_t.) Within a single thread, you're only reloading your own stores, so the CPU has to preserve the illusion of your instructions running in program order, regardless of what memory reordering is visible from other cores observing order of global visibility of its stores.

Using volatile for inter-thread communication is not recommended post C++11 (and is in fact data-race UB in ISO C++). But in practice mostly works, and is similar to rolling your own atomic load/store with std::memory_order_relaxed, because there are no run-time ordering guarantees. There's also no portable guarantee of atomicity with volatile, although some like GCC do choose to implement volatile by making it a single store instruction even in cases when they'd store two separate halves of a non-volatile variable even though it's only register width. e.g. for uint64_t on AArch64 when storing some constants. Since Linux kernel code uses volatile to roll its own atomic load/store, this presumably supports that use-case.

(Being like relaxed is true even on x86 where the hardware / asm model is program-order + a store-buffer with store forwarding. There's no C++ ordering guarantee at all wrt. non-volatile accesses, so compile-time reordering is allowed to break what would otherwise be release/acquire. BTW, this is presumably where MSVC's old-style volatile semantics came from, which did actually guarantee release/acquire semantics, in the bad old days before C++11 provided a standard way to get that. MSVC used to only target x86, and presumably didn't do compile-time reordering across volatile accesses. Fun fact: if you compile with modern MSVC with /volatile:ms, it will use barriers around volatile accesses when targeting ARM.)

Related possible or near duplicates:

Do volatile and mutex ensure memory ordering in C++?
Does the C++ volatile keyword introduce a memory fence?
May accesses to volatiles be reordered?
C - volatile and memory barriers in lockless shared memory access?
Does "volatile" guarantee anything at all in portable C code for multi-core systems? (These last 2 are C questions; I'm not aware of any important difference in standards or real-world implementations between C and C++ for volatile or <stdatomic.h> / <atomic>, other than C++20 providing std::atomic_ref so you can mix atomic and non-atomic access to the same memory locations in different phases of your program.)

Also semi-related: Who's afraid of a big bad optimizing compiler? - without volatile, just using compiler barriers to force memory access, you can get some surprising shenanigans if rolling your own atomics, like the Linux kernel still does.

score 2 · Answer 4 · answered Jul 05 '22 at 14:58

volatile accesses only forbid compile-time reordering, not run-time.

It is therefore required but not sufficient to guarantee a fixed order.

If the volatile is in normal memory then you shouldn't be using volatile at all but rather std::atomic or std::mutex to make the data safe for threads. Without threads any reordering in the CPU won't be observable.

If the volatile is for MMIO registers then you also have to set up your page tables to mark them as strictly ordered device memory. That prevents the CPU from reordering them.

Note: exact flags depend on the ARM/ARM64 version and page table format you are using.

PS: On a Raspberry Pi 1 you also need barriers whenever you switch between peripherals as the bus they are connected to will reorder reads between peripherals without telling the CPU and you get bad data.

How compiler enforces C++ volatile in ARM assembly

4 Answers4

Linked