Can a high-level language (Java) developer avoid CPU pipeline bubbles?

Question

If I understood correctly, a stall/freeze in a pipeline would cause a clock cycle to go waste. If a few cycles (out of billions in a second) go waste, it may not be a big deal or even measureable in terms of performance. But I was curious what operations would cause a pipeline bubble. Would a load barrier result in halting the pipeline - because data needs to be fetched from the memory/level-3-cache, or bubble is only created because current pipeline has a dependency on the result of the last pipeline? Something like:

int score = randGen.nextInt(6);
int pay = score * 100; //Dependent on the result of the last pipeline

If my above assumption of memeory barrier causing a bubble is incorrect, does that mean incorrect branch prediction, false memory sharing and load/store barriers leave CPUs worse off (fetching data from memory and flushing of load and store buffers)?

That's not a memory *barrier* (ordering of this core's access to its L1d cache), that's just a dependency on a load. Err, on a function return value ,which probably involves a load and some math, and a store. Unless your randGen is sharing a single generator across cores and having to do locking. Re: out-of-order exec basics, see [Modern Microprocessors A 90-Minute Guide!](https://www.lighterra.com/papers/modernmicroprocessors/). And for a deep dive (with nice diagrams) into a typical modern-ish CPU, see https://www.realworldtech.com/haswell-cpu/ and https://agner.org/optimize/ — Peter Cordes, Jun 05 '22 at 03:52
@PeterCordes My bad, the above code example was meant to be about pipeline stall - Pipeline executing second line would require result from the first one, hence pipeline stall? — Abidi, Jun 05 '22 at 06:22
Even an in-order pipeline would would only stall if the result wasn't ready. A random number generator doesn't have a huge amount of state, and if you use it often it tends to stay hot in cache. And if you don't use it often, the occasional cache miss when you do use it hopefully doesn't stall for very long. The first rule of thumb with caches is that caches work, and work well most of the time :P — Peter Cordes, Jun 05 '22 at 06:26
But yeah, if the PRNG did a load that missed in cache all the way to DRAM, that would likely be too much latency even for out-of-order exec to hide. But the whole pipeline wouldn't stall right away (unless like I said it was an in-order CPU, like the efficiency cores on some phones). The CPU will keep decoding later instructions, looking for independent work it can get started on, until it can eventually retire the cache-miss load and whatever instruction was dependent on the result and make room in its out-of-order instruction window (ROB) to see some new instructions. — Peter Cordes, Jun 05 '22 at 06:29
@PeterCordes Make sense. Is there a support (in terms of instructions) from intel hardaware to see what's going on in these modules when a program is executing, if not live maybe a summary at the end of a program? For example, when and where branch the predictor failed, how many times mem barriers were executed and what kind of impact ( in terms of nuking load/store buffers and pipelines) they caused? I understand that it may not be the easiest thing to link assembly instructions to Java code. — Abidi, Jun 05 '22 at 06:50
There are hardware performance counters that you can program to count events *while* a program executes. Under Linux, `perf stat java -jar foo.jar` will profile the whole process, for events including `branches` and `branch-misses`. `perf record` / `perf report` can statistically sample, recording a machine-code address every 10k events for example, so instructions with a lot of hits caused a lot of problems. — Peter Cordes, Jun 05 '22 at 06:56
(But `perf report` won't be able to find the JITed machine code created on the fly by the JVM, so you'd need some JVM-aware tools to make this work. `perf report` works for ahead-of-time compiled programs. And even then it sometimes does a bad job without debug symbols of letting you browse disassembly.) Anyway, I wouldn't be surprised if there is some Java tooling for using perf counters, perhaps as part of Intel's VTune, and maybe open source stuff built on the same Linux APIs as `perf`. See https://www.brendangregg.com/Slides/JavaOne2016_JavaFlameGraphs.pdf — Peter Cordes, Jun 05 '22 at 06:57
Ah good, https://www.brendangregg.com/perf.html#JIT_Symbols has some mention of Java and JIT engines, so there is some tooling for it. Not sure what to recommend for an intro to `perf`. To really make full use of it, you'd want to understand how your CPU works, to know what bottlenecks to look for, and what impact they might be having (and what perf events to count to test a theory). I guess the basic events like branch misses are not that hard to grasp. There's lots you can probably google about Intel PMU (performance monitoring unit) events and related things. — Peter Cordes, Jun 05 '22 at 07:02
@PeterCordes A follow up question this - can a cpu flush its write-buffer to cache/memory in the absence of mem barrier instructions? Let say, a program writes to three variables a, b and c respectively, CPU reorders first two writes and decides to flush its write-buffer soon after writing to variable b. This means value in b has been exposed to other CPUs but a and c are still local to that CPU? Unless out of order execution is allowed but not pushing to cache/memory? — Abidi, Jul 02 '22 at 22:10
A CPU *always* tries to drain the store buffer as quickly as possible (one store at a time), so it has room to absorb a burst of stores without stalling. You only need a memory barrier if you want to make other operations wait for that to happen. Java `volatile` implies ordering, though, so it will use "release" stores which can only commit from the store buffer in program order. (Otherwise yeah, most ISAs other than x86 allow stores to commit out of order, which can happen in practice on cache miss. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations) — Peter Cordes, Jul 02 '22 at 22:16
See also [Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?](https://stackoverflow.com/a/61591912) - no direct effect. — Peter Cordes, Jul 02 '22 at 22:18
It means in the absence of a mem barrier, a cpu could drain a store buffer out of order. In my case above, variable b could be visible to other cpus before a and c? Scary! — Abidi, Jul 03 '22 at 07:07
Yes, on a weakly-ordered ISA like AArch64, sure. How is that scary, though? If you're writing in a high-level language, not asm, you need to use some language feature to get any guarantee of the order your memory operations will become visible to other threads; if you don't, optimization could result in compile-time reordering, with asm program order being different from high-level source program order. Especially if there are multiple operations on any of the variables, a compiler might only actually store the final one. — Peter Cordes, Jul 03 '22 at 07:14
Non-atomic / non-volatile variables in Java give very little guarantee about inter-thread visibility, but `volatile` operations can order others wrt. themselves, allowing acquire/release sync. In C++, it's fully undefined behaviour to have a data race on a non-`atomic` variable. In hand-written asm, loads and stores are like C++ `memory_order_relaxed` atomics unless you use special instructions (e.g. AArch64 `ldapr` acquire load without sequential-consistency) or barriers. Or you're on an ISA like x86 that guarantees program-order memory operations, except for the store buffer. — Peter Cordes, Jul 03 '22 at 07:18
So yeah, it's scary in asm, especially PowerPC where shenanigans like IRIW reordering are possible. [Will two atomic writes to different locations in different threads always be seen in the same order by other threads?](https://stackoverflow.com/a/50679223). But if you're using Java, you either have pretty strong guarantees or you have zilch, you can't even weaken it to an equivalent of C++ `memory_order_acq_rel`, only `seq_cst` as far as I know. So your code is either totally unsafe and probably broken, or portable. Or happens-to-work on most ISAs, but with possibility of failure. — Peter Cordes, Jul 03 '22 at 07:22
Was trying to build an IPC via memory mapped files, there are 2 mmfs in my case, when producer process finishes writing to the second file, I write to a volatile variable, which means data from both writes gets pushed to both mmfs. My concern was what if CPU decides to execute writes before the volatile out of order and pushes them to memory out of order as well, this would confuse/break consumer process. What I need is ordering gaurantee (not neccesarily visibility - half volatile) between the 2 writes before write to the volatile variable. I guess VarHandle.releaseFence should do the trick. — Abidi, Jul 03 '22 at 10:14
Terminology: Executing a store instruction just puts the data in the store buffer. It's *commit* from store-buffer to L1d that makes it globally visible, which happens after retirement (so it's known to be non-speculative). But yeah, stores can commit out of order on ISAs that allow that, if you don't write code that stops that from happening. — Peter Cordes, Jul 03 '22 at 10:35
*I write to a volatile variable, which means data from both writes gets pushed to both mmfs.* Not really, although on many ISAs other than AArch64, a `volatile` store does make this thread stall until earlier stores have committed (full memory barrier for seq_cst semantics). It means any reader *which reads that volatile and sees that value* is also guaranteed to see previous non-volatile stores. If your design relies on a volatile write to turn multiple previous stores into a single transaction, or become visible in program order, your design is broken. — Peter Cordes, Jul 03 '22 at 10:40
Sounds like the store to the second MMF is what actually needs to be `volatile`. Or having release semantics some other way, like yes a release fence after other stores, before it. Of course your consumer process also needs to be using `volatile` loads, or otherwise getting acquire semantics, e.g. with an acquire barrier after a load, if ordering matters to it. You don't need to do anything to "push" stores out; a plain store / release-store will still become visible promptly. Unless Java makes it impossible to guarantee that a non-`volatile` store isn't optimized away to be done later. — Peter Cordes, Jul 03 '22 at 10:44
Definitely not using volatile as an atomic-transaction-provider - only for whatever is written before it gets visible to another process. Java nio doesn't seem to provide volatile feature, so I had to write to an indenpendent volatile variable in the writer process after it writes to MMF and read from an indepedent volatile in the reader process before it reads from MMF. I've used volatiles/mem-barriers for inter thread communication but never for inter process communication. Looks like ordering and visibility gaurantees is also available between two indendent processes. — Abidi, Jul 03 '22 at 11:07
But like I said, everything quickly becomes visible to another process anyway, volatile write or no, unless used in a loop or something. I guess writing a private `volatile` could be a clunky way to guard against that, used only for the "compiler barrier" aspect (like GNU C `asm("" ::: "memory")`, no instructions) of making sure previous stores actually executed in the asm, and thus will become globally visible within several tens of nanoseconds. One would hope there'd be something cheaper that didn't involve a useless (in this case) full memory barrier against run-time reordering. — Peter Cordes, Jul 03 '22 at 11:40
I'd guess `volatile` having a compiler-barrier effect is a happens-to-work thing not fully guaranteed by the language standard, since that `volatile` read before your actual reads isn't actually syncing with any writer. But sure, if that's what you have to do to get more of a correctness guarantee in a language like Java where you can't just do an atomic-load on an arbitrary pointer. (Like you could in C++ or assembly.) — Peter Cordes, Jul 03 '22 at 11:42

Stephen C · Answer 1 · 2022-06-05T03:43:11.527

2

If I understood correctly, a stall/freeze in a pipeline would cause a clock cycle to go waste.

More or less correct. But it doesn't really matter for the rest of my answer.

Can a high-level language (Java) developer avoid CPU pipeline bubbles?

No. The Java developer is two levels of abstraction away from the (erm) problem. (Java compiles to bytecodes, and bytecodes compile to native code. The native code instruction sequence is what will determine if there are pipeline stalls, etc.)

However, there is some good news. The HotSpot JIT compiler that compiles the bytecodes to native code includes an optimizer which should take care of optimizing the instruction sequence. It will probably do a better job of this than you could do yourself by hand. And it doesn't consume hours / days of programmer time to do this. (Which will make your boss happy!)

My advice: don't worry about it.

But if you have performance requirements that mean that you need to worry about it, my advice would be:

consider recoding key parts of your application in C, and using the best (commercial) optimizing C compiler that you can afford, OR
consider recoding in assembly language.

But bear in mind that neither of those will guarantee that you application is faster, after all the time and effort.

edited Jun 05 '22 at 03:43

answered Jun 04 '22 at 14:17

Stephen C

698,415
94
811
1,216

Got it!! For the last part of my post. Probably i'm being dumb, does it mean performance penalty for an incorrect branch prediction, a memory barrier and false mem sharing is far higher than stalling a pipeline? – Abidi Jun 04 '22 at 15:15
I don't know. Not my area of expertise. But it is irrelevant to your headline question. – Stephen C Jun 05 '22 at 02:06
1

@Abidi: All of those things *do* stall (part of) the pipeline. False sharing on some ISAs could "just" cause cache misses, which take so long that the CPU runs out of independent work to do given its limited lookehead window ([ROB size](https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/)) so eventually stall the pipeline. On x86 it can be even worse: to maintain the LoadLoad ordering guarantees, Intel CPUs will nuke (not just stall) the pipeline, wiping out in-flight instructions, if they detect memory-order mis-speculation from doing a load early then having another core invalidate – Peter Cordes Jun 05 '22 at 03:48
And of course, these things are potentially different for different ISAs, etc ... – Stephen C Jun 05 '22 at 03:54
1

@Abidi: Modern OoO exec CPUs are remarkably robust to the mediocre assembly generated by JIT compilers in JVMs and especially JavaScript engines in web browsers. And even by ahead-of-time C and C++s compiler which aren't always great either. Compilers don't have to work hard to schedule instructions because CPUs will do it at run-time. See [Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs](https://stackoverflow.com/q/37361145) – Peter Cordes Jun 05 '22 at 03:55
Data dependencies that exist in the source (especially for floating-point math) can lead to latency bottlenecks, though. Usually compilers don't introduce extra latency in the critical path of a loop-carried dependency (at least good ahead-of-time compilers, but not always JITs). So if latency bottlenecks were a problem, you can often see that from the source code. With FP stuff, the compiler isn't allowed to rearrange computation in ways that create different temporaries and thus rounding, but integer math is associative which allows more optimization, like `sum1 += a[i];` `sum2 += a[i+1]` – Peter Cordes Jun 05 '22 at 03:59
Prior to Java 17, the compiler had more latitude for reordering FP operations ... unless you used `strictfp`. In Java 17 and later, all floating point is implicitly FP-strict. See https://openjdk.java.net/jeps/306 – Stephen C Jun 05 '22 at 04:23

Can a high-level language (Java) developer avoid CPU pipeline bubbles?

1 Answers1