7

On the one hand, Wikipedia writes about the steps of the out-of-order execution:

  1. Instruction fetch.
  2. Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
  3. The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, older instructions.
  4. The instruction is issued to the appropriate functional unit and executed by that unit.
  5. The results are queued.
  6. Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.

The similar information can be found in the "Computer Organization and Design" book:

To make programs behave as if they were running on a simple in-order pipeline, the instruction fetch and decode unit is required to issue instructions in order, which allows dependences to be tracked, and the commit unit is required to write results to registers and memory in program fetch order. This conservative mode is called in-order commit... Today, all dynamically scheduled pipelines use in-order commit.

So, as far as I understand, even if the instructions execution is done in the out-of-order manner, the results of their executions are preserved in the reorder buffer and then committed to the memory/registers in a deterministic order.

On the other hand, there is a known fact that modern CPUs can reorder memory operations for the performance acceleration purposes (for example, two adjacent independent load instructions can be reordered). Wikipedia writes about it here.

Could you please shed some light on this discrepancy?

Cœur
  • 37,241
  • 25
  • 195
  • 267
undermind
  • 1,779
  • 13
  • 33
  • 1
    The word "commit" is actually a bit fuzzy. If you take its definition literally, than there's almost no room for any sort of OOE. You don't need to wait for an instruction to "commit" before you can use its result. I'm unclear on exactly how it works internally. And it almost certainly is very intertwined with speculation recovery from branch prediction and memory disambiguation. – Mysticial Sep 23 '16 at 22:10
  • At the very least, each instruction will have multiple "commit"-like phases: 1) When the output is ready to use for another instruction. 2) When the instruction is no longer in speculation. 3) When the instruction is removed from the reorder buffer. Your example with loads isn't restricted to loads, but pretty much *any* instruction that writes to a register. – Mysticial Sep 23 '16 at 22:11
  • @Mysticial: I'm pretty sure "commit" is being used here as a synonym for "retire". It can only happen when an instruction is done executing, and when it's known to be non-speculative (i.e. when all preceding instructions are retired without faulting). – Peter Cordes Sep 23 '16 at 22:36
  • BTW, there has been some research into out-of-order retirement, while still having precise exceptions by using checkpoints to roll back when exceptions are detected. e.g. this paper about Kilo-Instruction processors http://csl.cornell.edu/~martinez/doc/taco04.pdf is interesting. (Kilo-instruction as in out-of-order reordering window equivalent to a ROB-size of 1k, allowing that big a window for hiding cache-miss latency without actually being impractical to build). @Mysticial – Peter Cordes Sep 23 '16 at 22:40

1 Answers1

8

TL:DR: memory ordering is not the same thing as out of order execution. It happens even on in-order pipelined CPUs.

In-order commit is necessary1 for precise exceptions that can roll-back to exactly the instruction that faulted, without any instructions after that having already retired. The cardinal rule of out-of-order execution is don't break single-threaded code. If you allowed out-of-order commit (retirement) without any kind of other mechanism, you could have a page-fault happen while some later instructions had already executed once, and/or some earlier instructions hadn't executed yet. This would make restarting execution after handing a page-fault impossible the normal way.

(In-order issue/rename and dependency-tracking takes care of correct execution in the normal case of no exceptions.)

Memory ordering is all about what other cores see. Also notice that what you quoted is only talking about committing results to the register file, not to memory.

(Footnote 1: Kilo-instruction Processors: Overcoming the Memory Wall is a theoretical paper about checkpointing state to allow rollback to a consistent machine state at some point before an exception, allowing much larger out-of-order windows without a gigantic ROB of that size. AFAIK, no mainstream commercial designs have used that, but it shows that there are in theory approaches other than strictly in-order retirement to building a usable CPU.

Apple's M1 reportedly has a significantly larger out-of-order window than its x86 contemporaries, but I haven't seen any definite info that it uses anything other than a very large ROB.)


Since each core's private L1 cache is coherent with all the other data caches in the system, memory ordering is a question of when instructions read or write cache. This is separate from when they retire from the out-of-order core.

Loads become globally visible when they read their data from cache. This is more or less when they "execute", and definitely way before they retire (aka commit).

Stores become globally visible when their data is committed to cache. This has to wait until they're known to be non-speculative, i.e. that no exceptions or interrupts will cause a roll-back that has to "undo" the store. So a store can commit to L1 cache as early as when it retires from the out-of-order core.

But even in-order CPUs use a store queue or store buffer to hide the latency of stores that miss in L1 cache. The out-of-order machinery doesn't need to keep tracking a store once it's known that it will definitely happen, so a store insn/uop can retire even before it commits to L1 cache. The store buffer holds onto it until L1 cache is ready to accept it. i.e. when it owns the cache line (Exclusive or Modified state of the MESI cache coherency protocol), and the memory-ordering rules allow the store to become globally visible now.

See also my answer on Write Allocate / Fetch on Write Cache Policy

As I understand it, a store's data is added to the store queue when it "executes" in the out-of-order core, and that's what a store execution unit does. (Store-address writing the address, and store-data writing the data into the store-buffer entry reserved for it at allocation/rename time, so either of those parts can execute first on CPUs where those parts are scheduled separately, e.g. Intel.)

Loads have to probe the store queue so that they see recently-stored data.


For an ISA like x86, with strong ordering, the store queue has to preserve the memory-ordering semantics of the ISA. i.e. stores can't reorder with other stores, and stores can't become globally visible before earlier loads. (LoadStore reordering isn't allowed (nor is StoreStore or LoadLoad), only StoreLoad reordering).

David Kanter's article on how TSX (transactional memory) could be implemented in different ways than what Haswell does provides some insight into the Memory Order Buffer, and how it's a separate structure from the ReOrder Buffer (ROB) that tracks instruction/uop reordering. He starts by describing how things currently work, before getting into how it could be modified to track a transaction that can commit or abort as a group.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    _In-order commit makes the current core's own code see itself as running in-order_; Are you referring to current _thread's_ own code? In which case a thread should see its own code in-order even if instructions retired OoO (respecting RAW/WAR/WAW hazards). Two threads running on same core would not observe reordering by virtue of in-order commit (plus precise interrupts / context switch) – Daniel Nitzan Jan 04 '21 at 20:58
  • 1
    @DanielNitzan: True, that's misleading phrasing. In-order *issue* takes care of correctness for the no-exceptions case. The actual problem with out-of-order commit would be that you'd lose precise exceptions: e.g. seeing a page-fault happen before some earlier instructions execute, or after some later instructions execute. (Which would make restarting execution after a page fault impossible, which is why nobody builds normal CPUs this way...) – Peter Cordes Jan 04 '21 at 21:14
  • @DanielNitzan: Updated, thanks for catching that mistake in this older answer. – Peter Cordes Jan 04 '21 at 21:31