2

I'm trying to understand how the "fetch" phase of the CPU pipeline interacts with memory.

Let's say I have these instructions:

4:  bb 01 00 00 00          mov    $1,%ebx
9:  bb 02 00 00 00          mov    $2,%ebx
e:  b3 03                   mov    $3,%bl

What happens if CPU1 writes 00 48 c7 c3 04 00 00 00 to memory address 8 (i.e. 64-bit aligned) while CPU2 is executing these same instructions? The instruction stream would atomically change from 2 instructions to 1 like this:

4:  bb 01 00 00 00          mov    $1,%ebx
9:  48 c7 c3 04 00 00 00    mov    $4,%rbx

Since CPU1 is writing to the same memory that CPU2 is reading from, there's contention. Would the write cause the CPU2 pipeline to stall while it refreshes its L1 cache? Let's say that CPU2 has just completed the "fetch" pĥase for mov $2, would that be discarded in order to re-fetch the updated memory?

Additionally there's the issue of atomicity when changing 2 instructions into 1.

I found this quite old document that mentions "The instruction fetch unit fetches one 32-byte cache line in each clock cycle from the instruction cache memory" which I think can be interpreted to mean that each instruction gets a fresh copy of the cache line from L1, even if they share the same cache line. But I don't know if/how this applies to modern CPUs.

If the above is correct, that would mean after fetching mov $2 into the pipeline, it's possible the next fetch would get the updated value at address e and try to execute 00 00 (add %al,(%rax)) which would probably fail.

But if the fetch of mov $2 brings mov $3 into an "instruction cache", would it make sense to think that the next fetch would just get the instruction from that cache (and return mov $3) without re-querying L1? This would effectively make the fetch of these 2 instructions atomic, as long as they share a cache line.

So which is it? Basically there's too many unknowns and too much I can only speculate about, so I'd really appreciate a clockcycle-by-clockcycle breakdown of how 2 fetch phases of the pipeline interact with (changes in) the memory they access.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Daniel
  • 2,869
  • 4
  • 26
  • 28
  • This is all implementation-dependent. Different processors deal with the situation differently. – Raymond Chen Jun 15 '21 at 15:00
  • 1
    For a core modifying *its own* code, see: [Observing stale instruction fetching on x86 with self-modifying code](https://stackoverflow.com/q/17395557) - that's different (and harder) because out-of-order exec of the store has to be sorted out from code-fetch of earlier vs. later instructions in program order. i.e. the moment at which the store must become visible is fixed, unlike with another core where it just happens when it happens. – Peter Cordes Jun 15 '21 at 16:23

2 Answers2

3

It varies between implementations, but generally, this is managed by the cache coherency protocol of the multiprocessor. In simplest terms, what happens is that when CPU1 writes to a memory location, that location will be invalidated in every other cache in the system. So that write will invalidate the line in CPU2's instruction cache as well as any (partially) decoded instructions in CPU2's uop cache (if it has such a thing). So when CPU2 goes to fetch/execute the next instruction, all those caches will miss and it will stall while things are refetched. Depending on the cache coherency protocol, that may involve waiting for the write to get to memory, or may fetch the modified data directly from CPU1's dcache, or things might go via some shared cache.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • 1
    Indeed. But unlike [Observing stale instruction fetching on x86 with self-modifying code](https://stackoverflow.com/q/17395557), it *doesn't* have to invalidate already-fetched instructions in the pipeline (no pipeline nuke). I-fetch happens in-order, so seeing it or not is just a matter of feting before or after this core had its copy of the cache line invalidated. Note that x86 has coherent I-cache, but some other ISAs don't. At least on the core doing the stores, I cache needs to be invalidated (and maybe D-cache written back to a shared outer level) so fetch can see it. – Peter Cordes Jun 15 '21 at 16:27
  • 1
    Re: cache-to-cache transfers: a more common mechanism is write-back to a level of cache shared by both cores. That's L3 on modern Intel / AMD CPUs. Cache-to-cache transfers are also a thing, e.g. between CCXs on Zen, or between sockets on multi-core systems (in both cases, between L3 caches). Modern multi-core CPUs certainly avoid write-back to DRAM for data shared between cores; inter-core latency is too important for a round trip to DRAM. It's theoretically possible in a low performance design, though. – Peter Cordes Jun 15 '21 at 16:31
3

As Chris said, an RFO (Read For Ownership) can invalidate an I-cache line at any time.

Depending on how superscalar fetch-groups line up, the cache line can be invalidated between fetching the 5-byte mov at 9:, but before fetching the next instruction at e:.

When fetch eventually happens (this core gets a shared copy of the cache line again), RIP = e and it will fetch the last 2 bytes of the mov $4,%rbx. Cross-modifying code needs to make sure that no other core is executing in the middle of where it wants to write one long instruction.

In this case, you'd get 00 00 add %al, (%rax).

Also note that the writing CPU needs to make sure the modification is atomic, e.g. with an 8-byte store (Intel P6 and later CPUs guarantee that stores up to 8 bytes at any alignment within 1 cache line are atomic; AMD doesn't), or lock cmpxchg or lock cmpxchg16b. Otherwise it's possible for a reader to see partially updated instructions. You can consider instruction-fetch to be doing atomic 16-byte loads or something like that.


"The instruction fetch unit fetches one 32-byte cache line in each clock cycle from the instruction cache memory" which I think can be interpreted to mean that each instruction gets a fresh copy of the cache line from L1,

No.

That wide fetch block is then decoded into multiple x86 instructions! The point of wide fetch is to pull in multiple instructions at once, not to redo it separately for each instruction. That document seems to be about P6 (Pentium III), although P6 only does 16 bytes of actual fetch at once, into a 32-byte wide buffer that lets the CPU take a 16-byte window.

P6 is 3-wide superscalar, and every clock cycle can decode up to 16 bytes of machine code containing up to 3 instructions. (But there's a pre-decode stage to find instruction lengths first...)

See Agner Fog's microarch guide (https://agner.org/optimize/) for details, (with a focus on details that are relevant for turning software performance.) Later microarchitectures add queues between pre-decode and decode. See those sections of Agner Fog's microarch guide, and https://realworldtech.com/merom/ (Core 2).

And of course see https://realworldtech.com/sandy-bridge for more modern x86 with a uop cache. Also https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Core for recent AMD.

For good background before reading any of those, Modern Microprocessors: A 90-Minute Guide!.


For a core modifying its own code, see: Observing stale instruction fetching on x86 with self-modifying code - that's different (and harder) because out-of-order exec of the store has to be sorted out from code-fetch of earlier vs. later instructions in program order. i.e. the moment at which the store must become visible is fixed, unlike with another core where it just happens when it happens.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Ah, so the fetch stage operates on cache lines, and is decoupled from individual instructions. Unlike a classic RISC pipeline. Now it all makes a lot more sense. Thank you so much for the detailed answer and the wealth of informative links! – Daniel Jun 15 '21 at 17:48
  • @Daniel: A superscalar RISC pipeline would also do wider fetch, and decode that into 2 or 4 instructions. Also note that Intel P6 *doesn't* actually do 32-byte wide fetches, just 16. (Even current Intel only fetches 16 bytes at a time, so it depends on the uop cache to go faster than that, e.g. in regions of code with large average instruction size.) AMD does fetch 32 bytes at a time, IIRC, but they were later to adopt a uop cache. Also, modern x86 has 64-byte wide cache lines. So don't think of it as "whole line" fetch, just "wide fetch", and decode that block or until a branch. – Peter Cordes Jun 15 '21 at 17:55