Out of Order Execution, How to Solve True Dependency?

Question

I was reading about OOOE (Out of Order Execution) and read about how we can solve false dependencies (By using renaming).

But my question is, how can we solve true dependency (RAW - read after write)?

For example:

R1=R2+R3 #1
R1=R4+R5 #2
R9=R1 #3

Renaming won't be helpful here in case CPU chose to run #2 before #1.

Peter Cordes · Answer 1 · 2022-02-07T13:41:37.720

There is no way to really avoid them, that's why RAW hazards are called true dependencies. Instructions have to wait for their inputs to be ready before they can execute. (With OoO exec, normally CPUs will dispatch the oldest-ready-first instructions / uops to execution units, for example on Intel CPUs.)

True dependencies aren't something you "solve" in the sense of making them go away, they're the essence of computation, the way multiple computations on the same numbers are glued together to form an algorithm. Other hazards (WAR and WAW) are just implementation details, reusing the same architectural register for something different.

Sometimes you can structure an algorithm to have shorter dependency chains, once things are already nailed down into machine code, the CPU pretty much just has to respect them, with at best out-of-order exec to overlap independent dep chains.

For loads, in theory there's value-prediction, but AFAIK no real CPU is doing that. Mispredictions are expensive, just like for branches. That's why you'd only want to consider that for high-latency stuff like loads that miss in cache, otherwise the gains won't outweigh the costs. (Even then, it's not done because the gains don't outweigh the costs even for loads, including the power / area cost of building a predictor.) As Paul Clayton points out, branch prediction is a form of value prediction (predicting the condition the load was based on). The more instructions you can keep in flight at once with OoO exec, the more you stand to lose from mispredicts, but real CPUs do predict / speculate for memory disambiguation (whether a load reloads an earlier store to an unknown address or not), and (on CPUs like x86 with strongly-ordered memory models) speculating that early loads will turn out to be allowed; as well as the well known case of control dependencies (branch prediction + speculative execution).

The only thing that helps directly with dependency chains is keeping instruction latencies low. e.g. in the case of your example, #3 is just a register copy, which modern x86 CPUs can do with zero latency (mov-elimination), so another instruction dependent on R9 wouldn't have to wait an extra cycle beyond #2 producing a result, because it's handled during register renaming instead of by an execution unit reading an input and producing an output the normal way.

Obviously bypass forwarding from the outputs of execution units to the inputs of the same or others is essential to keep latency low, same as in an in-order classic RISC pipeline.

Or more conventionally, by improving execution units, like AMD Bulldozer family had 2-cycle latency for most SIMD integer instructions, but that improved to 1 cycle for AMD's next design, Zen. (Scalar integer stuff like add was always 1 cycle on any sane high-performance CPU.)

OoO exec with a large enough window size lets you overlap multiple dep chains in parallel (as in this experiment, and of course software should aim to have enough instruction-level parallelism (ILP) for the CPU to find and exploit. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an example of doing that by summing into multiple accumulators for a dot-product.)

This is also useful for in-order CPUs if done statically by a compiler, where techniques like "software pipelining" are a big deal to overlap execution of multiple loop iterations because HW isn't finding that parallelism for you. Or for OoO exec CPUs with a limited window size, for loops with long but not loop-carried dependency chains within each iteration.

As long as you're bottlenecked on something other than latency / dependency chains, true dependencies aren't the problem. e.g. a front-end bottleneck, ideally maxed out at the pipeline width, and/or all relevant back-end execution units busy every cycle, mean that you couldn't get more work through the pipeline even if it was independent.

Of course, in a lot of code there are enough dependencies, including through memory, to not reach that ideal situation.

Simultaneous Multithreading (SMT) can help to keep the back-end fed with work to do, increasing throughput by having the front-end of one physical core read multiple instruction streams, acting as multiple logical cores. This effectively creates ILP out of thread-level parallelism, which is useful if software can scale efficiently to more threads, exposing enough TLP to keep all the logical cores busy.

Modern Microprocessors A 90-Minute Guide!
How many CPU cycles are needed for each assembly instruction? - that's not how it works on superscalar OoO exec CPUs; latency or throughput or a front-end bottleneck might be the relevant thing.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

Sorry yet I didn't get an answer, how pipeline handles true dependency, does it stall or what? — John, Feb 07 '22 at 08:58
@John: An out-of-order exec pipeline only stalls the front-end if the ROB or RS are full. If any independent work is in the back-end, it can execute (usually in oldest-ready-first order). Dependent instructions have to wait for their inputs to be ready, of course, but that only fully stalls the pipeline on an in-order machine; you were asking about OoO exec. Bypass forwarding (https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing) is part of keeping latency low, including in a classic RISC style in-order pipeline. — Peter Cordes, Feb 07 '22 at 10:50
In general, see **[Modern Microprocessors A 90-Minute Guide!](http://www.lighterra.com/papers/modernmicroprocessors/)** — Peter Cordes, Feb 07 '22 at 10:51
@John: For Intel x86 CPUs specifically, [How are x86 uops scheduled, exactly?](https://stackoverflow.com/q/40681331) gets into detail about how OoO exec handles uops that are/aren't ready to execute, although the example in the question doesn't have a long dep chain. — Peter Cordes, Feb 07 '22 at 10:57
Value prediction is also theoretically possible. (Branch direction and target prediction could be considered simple value prediction. Predicate prediction would have similar overheads. Memory model speculation might be viewed as value prediction. In taxonomy I am a lumper.) — , Feb 07 '22 at 13:36
@PaulA.Clayton: Indeed, I did already mention value prediction here. It kind of got buried after I expanded the opening section to discuss how true dependencies aren't something you can just side-step. — Peter Cordes, Feb 07 '22 at 13:39

Out of Order Execution, How to Solve True Dependency?

1 Answers1

Linked