Java instruction reordering and CPU memory reordering

Question

This is a follow up question to

How to demonstrate Java instruction reordering problems?

There are many articles and blogs referring to Java and JVM instruction reordering which may lead to counter-intuitive results in user operations.

When I asked for a demonstration of Java instruction reordering causing unexpected results, several comments were made to the effect that a more general area of concern is memory reordering, and that it would be difficult to demonstrate on an x86 CPU.

Is instruction reordering just a part of a bigger issue of memory reordering, compiler optimizations and memory models? Are these issues really unique to the Java compiler and the JVM? Are they specific to certain CPU types?

Well, no. Your question makes it seem like it's a java-only problem, but race conditions are possible in every language, and depending on the optimizations used by the compiler - can be made if you did not pay attention while coding. And then the CPU architecture comes along, but that can still be scribed off to "the compiler messed it up" (because there's a different compiler for every CPU architecture). — Shark, Oct 14 '21 at 10:26
Very rarely the compiler breaks your code, it just optimizes what you wrote (caveat, C++ optimizations can actually break code), so if compiler reorders your code wrong, you did not put in necessary synchronization mechanisms in the first place. — Shark, Oct 14 '21 at 10:27
@Shark: If optimizations "break your code", it was already broken and just happens-to-work in some cases, e.g. with debug builds that store / reload everything to memory between statements. (Java doesn't have an equivalent to un-optimized builds, so I guess Java programmers never get the wrong idea that they code works in the first place in as many cases. Of course as an answer on the querent's linked previous question indirectly shows, you could by chance get release/acquire synchronization from lack of compile-time reordering on x86, and have it break on ARM / everything else.) — Peter Cordes, Oct 14 '21 at 13:27
@GonenI, Unfortunately, the topics of instruction reordering and memory (in) visibility between threads, are everywhere, down to below the instruction level. The best _you_ can do is first to assume the worst and then get it confirmed by [the memory model](https://en.wikipedia.org/wiki/Memory_model_(programming)) of your language. Then, use the dedicated constructs of your language that _do_ give guarantees, after reading their documentation well. All else is implementation detail that your compiler can handle for you, to deliver excellent performance! — Emmef, Oct 14 '21 at 13:28
@PeterCordes while i agree, you have to admit there's a significant difference between `-O1`, `-O2` and `-O3`. to put it simply, some optimizations can break code which *normally* works fine. But i do agree, when optimizations break your code - fix the code, not the makefile. — Shark, Oct 14 '21 at 16:34
@Shark: In C there's no such thing as "normally works fine". Modern compilers aggressively optimize based on the assumption of no undefined behaviour, so for correctness you can't usefully think in terms of the assembly language equivalent, e.g. for signed overflow detection: you need to avoid causing it in the first place. If your code is broken with `-O3` on some compiler, it could just as easily be broken with `-O1` on another compiler. (Only `-O0` is special for memory-ordering stuff because of not keeping values in registers across statements, and that's not anything you'd "normally" do.) — Peter Cordes, Oct 14 '21 at 20:11
@PeterCordes we could debate further (by e.g. taking embedded compilers into account) but we agree on this - when optimizations start breaking something, they're not the root cause, and the root cause should be fixed. Valgrind and other stuff should be used to identify potential caveats and the like. — Shark, Oct 15 '21 at 09:51
@Shark: yup, agreed. BTW, many people (e.g. \@supercat) are unhappy that modern C has evolved in this direction, to be full of unexpected landmines placed by compiler developers because the ISO language standard gives them license to break things. In early compilers, things Just Worked unless you actually were compiling for a machine where, for example, the memory model wasn't flat, causing potential weirdness when subtracting pointers to different objects. But now we can't assume that's safe (without casting to `uintptr_t`) even if we only care about flat memory models, not portability. — Peter Cordes, Oct 15 '21 at 10:04
@Shark: So you really have to know ISO C and C++, not just write things that "should obviously work", to write safe code for modern C and C++ compilers, respectively. The whole situation basically sucks, although it does let compilers make good asm for code that's written safely. — Peter Cordes, Oct 15 '21 at 10:06

Peter Cordes · Answer 1 · 2021-10-15T09:50:48.253

Memory reordering is possible without compile-time reordering of operations in source vs. asm. The order of memory operations (loads and stores) to coherent shared cache (i.e. memory) done by a CPU running a thread is also separate from the order it executes those instructions in.

Executing a load is accessing cache (or the store buffer), but executing" a store in a modern CPU is separate from its value actually being visible to other cores (commit from store buffer to L1d cache). Executing a store is really just writing the address and data into the store buffer; commit isn't allowed until after the store has retired, thus is known to be non-speculative, i.e. definitely happening.

Describing memory reordering as "instruction reordering" is misleading. You can get memory reordering even on a CPU that does in-order execution of asm instructions (as long as it has some mechanisms to find memory-level parallelism and let memory operations complete out of order in some ways), even if asm instruction order matches source order. Thus that term wrongly implies that merely having plain load and store instructions in the right order (in asm) would be useful for anything related to memory order; it isn't, at least on non-x86 CPUs. It's also weird because instructions have effects on registers (at least loads, and on some ISAs with post-increment addressing modes, stores can, too).

It's convenient to talk about something like StoreLoad reordering as x = 1 "happening" after a tmp = y load, but the thing to talk about is when the effects happen (for loads) or are visible to other cores (for stores) in relation to other operations by this thread. But when writing Java or C++ source code, it makes little sense to care whether that happened at compile time or run-time, or how that source turned into one or more instructions. Also, Java source doesn't have instructions, it has statements.

Perhaps the term could make sense to describe compile-time reordering between bytecode instructions in a .class vs. JIT compiler-generate native machine code, but if so then it's a mis-use to use it for memory reordering in general, not just compile/JIT-time reordering excluding run-time reordering. It's not super helpful to highlight just compile-time reordering, unless you have signal handlers (like POSIX) or an equivalent that runs asynchronously in the context of an existing thread.

This effect is not unique to Java at all. (Although I hope this weird use of "instruction reordering" terminology is!) It's very much the same as C++ (and I think C# and Rust for example, probably most other languages that want to normally compile efficiently, and require special stuff in the source to specify when you want your memory operations ordered wrt. each other, and promptly visible to other threads). https://preshing.com/20120625/memory-ordering-at-compile-time/
C++ defines even less than Java about access to non-atomic<> variables without synchronization to ensure that there's never a write in parallel with anything else (undefined behaviour¹).

And even present in assembly language, where by definition there's no reordering between source and machine code. All SMP CPUs except a few ancient ones like 80386 also do memory-reordering at run-time, so lack of instruction reordering doesn't gain you anything, especially on machines with a "weak" memory model (most modern CPUs other than x86): https://preshing.com/20120930/weak-vs-strong-memory-models/ - x86 is "strongly ordered", but not SC: it's program-order plus a store buffer with store forwarding. So if you want to actually demo the breakage from insufficient ordering in Java on x86, it's either going to be compile-time reordering or lack of sequential consistency via StoreLoad reordering or store-buffer effects. Other unsafe code like the accepted answer on your previous question that might happen to work on x86 will fail on weakly-ordered CPUs like ARM.

(Fun fact: modern x86 CPUs aggressively execute loads out of order, but check to make sure they were "allowed" to do that according to x86's strongly-ordered memory model, i.e. that the cache line they loaded from is still readable, otherwise roll back the CPU state to before that: machine_clears.memory_ordering perf event. So they maintain the illusion of obeying the strong x86 memory-ordering rules. Other ISAs have weaker orders and can just aggressively execute loads out of order without later checks.)

Some CPU memory models even allow different threads to disagree about the order of stores done by two other threads. So the C++ memory model allows that, too, so extra barriers on PowerPC are only needed for sequential consistency (atomic with memory_order_seq_cst, like Java volatile) not acquire/release or weaker orders.

How does memory reordering help processors and compilers?
How is load->store reordering possible with in-order commit? - memory reordering on in-order CPUs via other effects, like scoreboarding loads with a cache that can do hit-under-miss, and/or out-of-order commit from the store buffer, on weakly-ordered ISAs that allow this. (Also LoadStore reordering on OoO exec CPUs that still retire instructions in order, which is actually more surprising than on in-order CPUs which have special mechanisms to allow memory-level parallelism for loads, that OoO exec could replace.)
Are memory barriers needed because of cpu out of order execution or because of cache consistency problem? (basically a duplicate of this; I didn't say much there that's not here)
Are loads and stores the only instructions that gets reordered? (at runtime)
Does an x86 CPU reorder instructions? (yes)
Can a speculatively executed CPU branch contain opcodes that access RAM? - store execution order isn't even relevant for memory ordering between threads, only commit order from the store buffer to L1d cache. A store buffer is essential to decouple speculative exec (including of store instructions) from anything that's visible to other cores. (And from cache misses on those stores.)
Why is integer assignment on a naturally aligned variable atomic on x86? - true in asm, but not safe in C/C++; you need std::atomic<int> with memory_order_relaxed to get the same asm but in portably-safe way.
Globally Invisible load instructions - where does load data come from: store forwarding is possible, so it's more accurate to say x86's memory model is "program order + a store buffer with store forwarding" than to say "only StoreLoad reordering", if you ever care about this core reloading its own recent stores.
Why memory reordering is not a problem on single core/processor machines? - just like the as-if rule for compilers, out-of-order exec (and other effects) have to preserve the illusion (within one core and thus thread) of instructions fully executing one at a time, in program order, with no overlap of their effects. This is basically the cardinal rule of CPU architecture.
LWN: Who's afraid of a big bad optimizing compiler? - surprising things compilers can do to C code that uses plain (non-volatile / non-_Atomic accesses). This is mostly relevant for the Linux kernel, which rolls its own atomics with inline asm for some things like barriers, but also just C volatile for pure loads / pure stores (which is very different from Java volatile².)

Footnote 1: C++ UB means not just an unpredictable value loaded, but that the ISO C++ standard has nothing to say about what can/can't happen in the whole program at any time before or after UB is encountered. In practice for memory ordering, the consequences are often predictable (for experts who are used to looking at compiler-generated asm) depending on the target machine and optimization level, e.g. hoisting loads out of loops breaking spin-wait loops that fail to use atomic. But of course you're totally at the mercy of whatever the compiler happens to do when your program contains UB, not at all something you can rely on.

Caches are coherent, despite common misconceptions

However, all real-world systems that Java or C++ run multiple threads across do have coherent caches; seeing stale data indefinitely in a loop is a result of compilers keeping values in registers (which are thread-private), not of CPU caches not being visible to each other. This is what makes C++ volatile work in practice for multithreading (but don't actually do that because C++11 std::atomic made it obsolete).

Effects like never seeing a flag variable change are due to compilers optimizing global variables into registers, not instruction reordering or cpu caching. You could say the compiler is "caching" a value in a register, but you can choose other wording that's less likely to confuse people that don't already understand thread-private registers vs. coherent caches.

Footnote 2: When comparing Java and C++, also note that C++ volatile doesn't guarantee anything about memory ordering, and in fact in ISO C++ it's undefined behaviour for multiple threads to be writing the same object at the same time even with volatile. Use std::memory_order_relaxed if you want inter-thread visibility without ordering wrt. surrounding code.

(Java volatile is like C++ std::atomic<T> with the default std::memory_order_seq_cst, and AFAIK Java provides no way to relax that to do more efficient atomic stores, even though most algorithms only need acquire/release semantics for their pure-loads and pure-stores, which x86 can do for free. Draining the store buffer for sequential consistency costs extra. Not much compared to inter-thread latency, but significant for per-thread throughput, and a big deal if the same thread is doing a bunch of stuff to the same data without contention from other threads.)

`C++ UB means not just an unpredictable value loaded, but that the ISO C++ standard has nothing to say about what can/can't happen in the whole program at any time before or after UB is encountered.` . . . . . "and that compiler authors can do whatever they want in such a case". UB is UB, and some of them will behave differently when compiled with different compilers. — Shark, Oct 14 '21 at 16:39
@Shark: Exactly. That gives compilers license to optimize based on the assumption that it won't happen, like [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783) when vectorizing even on x86 where unaligned is safe in asm. And also [Does the C++ standard allow for an uninitialized bool to crash a program?](https://stackoverflow.com/q/54120862). Or in this case, it's what lets compilers ignore what other threads might see in memory for non-atomic objects if they looked any time they're not allowed to, because of data race UB. — Peter Cordes, Oct 14 '21 at 20:15
I just felt it was necessary to say that (well, add) that UB doesn't behave uniformly across different platforms AND compilers. It's not standardized so everyone does what they think is best in that case. That little tiny detail can sometimes really get you :D — Shark, Oct 15 '21 at 09:43
@Shark: I think I found the part of my answer you were objecting to: right *after* what you quoted, where I suggested that UB effects were often predictable for experts. I reworded that some. Yes, of course it's not uniform across platforms or compilers. Although the optimizations that affect data-race UB (e.g. whether a store or loads happens or not) are often fairly straightforward and done by any sane compiler (hosting loads of loop-invariant things), if the code isn't too complex. — Peter Cordes, Oct 15 '21 at 09:55
Unlike most other cases of UB where you can understand after the fact by looking at asm, but might not have guessed exactly how it would break on a given compiler / target / optimization-level combo. And I'm definitely not endorsing writing programs that contain UB based on understanding what's going to happen, I really just meant that (as someone who looks at asm output from gcc/clang all the time), I can anticipate how something would break. Or if I want to demonstrate why something is unsafe, I can usually come up with C that compiles to asm that manifests the problem. — Peter Cordes, Oct 15 '21 at 09:58
Wasn't objecting at all, was just adding that it can/will work/manifest differently across different compilers. Yes, most compilers will behave uniformly in such that they will be emitting code that will be broken. But the way it's broken will differ from compiler to compiler. Thats all i meant to add so the readers don't get the idea that *that* part is uniform across different compilers. "Here be dragons" and all. And then it's worth mentioning the difference between **undefined behaviour** and **unspecified behaviour** as well :D — Shark, Oct 15 '21 at 10:10
@Shark: Right, not objecting, but thinking needed further clarification that my answer wasn't already providing. Just out of curiousity, did you have an example in mind of a data-race UB effect that breaks in *different* ways on different compilers without happening to work on any of them? The example I cited in my answer (of a `while(keep_running){}` [loop being infinite](https://stackoverflow.com/q/58516052/224132)) breaks the same way on every compiler. The insidious thing about UB is that it's not *guaranteed* to break, and in practice often doesn't. — Peter Cordes, Oct 15 '21 at 10:15
Not off the top of my head, but there's some code in one of my answers, about reading the whole file... with the method `slurp()` which "seems to work" but ends up smashing the stack sometimes, and breaks differently on Mac/Windows alone. Without adding different compilers into the mix :D Maybe `int a = 0; int b = 15 % a` or `signed int x = -1234; x >> 1` or `char ch = 1234` or maybe something stupid like `double d[1024][1024][1024][1024]` these should all be unspecified behaviours which are clearly broken just by looking at them. — Shark, Oct 15 '21 at 10:29
@Shark: My answer was talking specifically about data-race UB. Yes, certainly out-of-bounds access to arrays to step on stack memory can break in different ways, even when it does break everywhere. The way it breaks (overwriting other stuff on the stack) is even more obvious than memory-order stuff, but the exact symptoms of course depend on what gets overwritten. Note that `char ch = 1234;` produces an implementation-defined *value* (or signal), but not undefined *behaviour*; out-of-bounds narrowing integer conversions aren't UB. — Peter Cordes, Oct 15 '21 at 10:31
and that method i mentioned from [this answer](https://stackoverflow.com/questions/3747086/reading-the-whole-text-file-into-a-char-array-in-c/36547932#36547932) - is one of those that "work normally but break sometimes". no, it's not good code, but it just smashes the stack every once in a while :D as for data races, nope, i don't have any off the top of my head. been burned plenty of times to not do things as they should be done. — Shark, Oct 15 '21 at 10:35
A small note since this topic is about Java: In Java, it is not allowed to run into UB. A read most see a value written by a write; either from a write that is ordered before it by the happens-before order or a write, it is in data-race with. — pveentjer, Feb 19 '22 at 04:52
Java does provide more relaxed access modes. So we have acquire loads and release stores. And we also have opaque load/store (comparable but not the same as memory_order_relaxed). These relaxed modes can be accessed through Unsafe or through the VarHandle API. But, unfortunately, the relaxed access modes are not part of the JMM, so officially they are classified as data races. — pveentjer, Feb 19 '22 at 04:55
It is also possible to use fences directly. Again through the VarHandle API and Unsafe. — pveentjer, Feb 19 '22 at 04:56

Java instruction reordering and CPU memory reordering

1 Answers1

Caches are coherent, despite common misconceptions

Linked