What happens with the IFU and the front end when an instruction is not in L1I?

Question

Firstly, when the IFU issues a request for 16 bytes, is this interaction with the L1I modified/fixed such that when L1I receives an address from the IFU it will subsequently produce 16 bytes in succession or does the IFU have to send the addresses of all 16 bytes like with a traditional cache access?

To get to the point, assume the IFU is fetching instructions at the 16B aligned boundaries and suddenly the virtual index (and i'm assuming the virtual index is indeed logical virtual and not linear virtual -- not entirely sure; I know with L1D the AGU handles the segmentation offsets) misses in the L1i cache.

What would happen exactly? (Note: example CPU Skylake with a ring bus topology)

Would the front end be shut down when the decoders are finished decoding whatever was before it, how would this be done? Secondly, what sort of negotiation / conversation is there between the IFU and the L1I cache, there is a miss, the cache must inform the IFU so that it stops fetching instructions? Perhaps the cache waits to receive the data from lower down and as soon as it does, issue the data to the IFU, or does the IFU wait in a spin-lock state and keep attempting the read?

Let's assume that the data it wants is on a DDR4 module and not in the cache subsystem at all -- possible if an erratic program is causing difficulties for the hardware prefetchers. I'd like to get the process clear in my mind.

L1I cache miss, ITLB hit.
L1I cache controller allocates a line fill buffer
L1I cache controller requests from the L2 and passes physical address to it (these operations do not clash with the hardware prefetchers operations because all cache accesses must be sequential or queued I'd imagine)
L2 miss, passes address to LLC slice
LLC slice miss
Caching agent sends address to the home agent
Home agent detects no cores with the data
Home agent sends the address to the memory controller
Memory controller converts address to (channel, dimm, rank, IC, chip, bank group, bank, row, column) tuple and does the relevant mapping, interleaving, command generation etc.
Now, since it's DDR4 it's going to return 128 bytes, but for simplification, assume it is DDR3 now, so 64 bytes. 64 bytes are sent back to the home agent, I assume this is all kept in queue order, so the home agent knows what address the data corresponds to.
The home agent sends the data to the caching agent, again I assume the caching agent perhaps keeps some backlog of misses to know that it needs to be sent higher
The data is passed to L2, don't know how L2 knows it needs to go higher but there you go
L2 controller passes the information to L1 and L1 knows, again, somehow, what line fill buffer to enter the requested cache line into and that it requires an F tag (forwarding).
The IFU either picks it up in its spin-lock state or some negotiation takes place with the IFU

If anyone has some more information on this process and can enlighten me further, please let me know.

All split / unaligned load handling is done inside the L1i cache. But I *think* instruction fetch on Intel CPUs is done in aligned 16-byte blocks. There are queues between later stages that allow grouping into unaligned chunks, though, so maybe only L1d has to deal with unaligned 16-byte / 32-byte loads. Outer caches will only see requests for whole lines, so the ring-bus interconnect between cores doesn't matter at all. The DRAM interface is also irrelevant. (I guess you *could* run code from an uncacheable memory region, but you're asking about caches.) — Peter Cordes, Nov 05 '18 at 05:45
Intel's L1 caches are VIPT: the tags are physical, not virtual. https://www.realworldtech.com/sandy-bridge/7/ and https://www.realworldtech.com/haswell-cpu/6/. The uop-cache is virtually addressed, though. See Agner Fog's microarch guide (https://agner.org/optimize/), and other links in the x86 tag wiki (https://stackoverflow.com/tags/x86/info). — Peter Cordes, Nov 05 '18 at 05:52
seg:off -> linear virtual translation happens before anything else. For data loads, it costs an extra cycle of latency if the segment base is non-zero. I assume instruction-cache loads are similar: the input to uop-cache checks and L1i + L1iTLB is a linear virtual address. — Peter Cordes, Nov 05 '18 at 05:53

Peter Cordes · Accepted Answer · 2018-11-06T02:58:46.223

Interesting question once you get past some of the misconceptions (see my comments on the question).

Fetch/decode happens strictly in program order. There's no mechanism for decoding a block from a later cache line while waiting on an L1i miss, not even to populate the uop-cache. My understanding is that the uop-cache is only ever populated with instructions the CPU expects to actually execute along the current path of execution.

(x86's variable-length instructions mean that you need to know an instruction boundary before you can even start decoding. This could be possible if branch-prediction says the cache-miss instruction block will branch somewhere in another cache line, but current hardware isn't built that way. There's nowhere to put the decoded instructions where the CPU could come back and fill in the gap.)

There's hardware prefetching into L1i (which I assume does take advantage of branch prediction to know where to branch next even if the current fetch is blocked on a cache miss), so code-fetch can generate multiple outstanding loads in parallel to keep the memory pipeline better occupied.

But yes, an L1i miss creates a bubble in the pipeline which lasts until data arrives from L2. Every core has its own private per-core L2 which takes care of sending requests off-core if it misses in L2. WikiChip shows the data path between L2 and L1i is 64 bytes wide in Skylake-SP.

https://www.realworldtech.com/haswell-cpu/6/ shows L2<->L1d is 64 bytes wide in Haswell and later, but doesn't show as much detail for instruction-fetch. (Which is often not a bottleneck, especially for small to medium-sized loops that hit in the uop cache.)

There are queues between fetch, pre-decode (instruction boundaries) and full decode, which can hide / absorb these bubbles and sometimes stop them from reaching the decoders and actually hurting decode throughput. And there's a larger queue (64 uops on Skylake) that feeds the issue/rename stage, called the IDQ. Instructions are added to the IDQ from the uop cache or from legacy decode. (Or when a microcode-indirect uop for an instruction that takes more than 4 uops reaches the front of the IDQ, issue/rename fetches directly from the microcode sequencer ROM, for instructions like rep movsb or lock cmpxchg.)

But when a stage has no input data, yes it powers down. There's no "spin-lock"; it's not managing exclusive access to a shared resource, it's simply waiting based on a flow-control signal.

This also happens when code fetch hits in the uop cache: the legacy decoders can power down as well. Power saving is one of the benefits of the uop cache, and of the loopback buffer saving power for the uop cache.

L1I cache controller allocates a line fill buffer

L2->L1i uses different buffers than the 10 LFBs that L1d cache / NT stores use. Those 10 are dedicated to the connection between L1d and L2.

The Skylake-SP block diagram on WikiChip shows a 64-byte data path from L2 to L1i, separate from the L2->L1d with its 10 LFBs.

L2 has to manage having multiple readers and writers (L1 caches, and data to/from L3 on its SuperQueue buffers). @HadiBrais comments that we know that the L2 can handle 2 hits per clock cycle, but number of misses per cycle it can handle / generate L3 requests for is less clear.

Hadi also commented: The L2 has one read 64-byte port for the L1i and one bidirectional 64-byte port with the L1d. It also has a read/write port (64-byte in Skylake, 32-byte in Haswell) with the L3 slice it is connected to. When the L2 controller receives a line from the L3, it immediately writes it into the corresponding superqueue entry (or entries).

I haven't checked a primary source for this, but it sounds right to me.

Fetch from DRAM happens with burst transfers of 64 bytes (1 cache line) at once. Not just 16 bytes (128 bits)! It's possible to execute code from an "uncacheable" memory region, but normally you're using WB (write-back) memory regions that are cacheable.

AFAIK, even DDR4 has a 64-byte burst size, not 128 bytes.

I assume this is all kept in queue order, so the home agent knows what address the data corresponds to.

No, the memory controller can reorder requests for locality within a DRAM page (not the same thing as a virtual-memory page).

Data going back up the memory hierarchy has an address associated with it. It gets cached by L3, and L2, because they have a write-allocate cache policy.

When it arrives at L2, the outstanding request buffer (from L1i) matches the address, so L2 forwards that line to L1i. Which in turn matches the address and wakes up the instruction-fetch logic that was waiting.

@HadiBrais commented: Requests at the L2 need to be tagged with the sender ID. Requests at the L3 need to be tagged with yet another sender ID. The requests at the L1I need not be tagged.

Hadi also discussed the fact that L3 needs to deal with requests from multiple cores per cycle. The ring bus architecture in CPUs before Skylake-SP / SKX meant that at most 3 requests could arrive at a single L3 slice per clock (one in each direction on the ring, and one from the core attached to it). If they were all for the same cache line, it would definitely be advantageous to satisfy them all with a single fetch from this slice, so this might be something that L3 cache slices do.

See also Ulrich Drepper's What Every Programmer Should Know About Memory? for more about cache and especially about DDR DRAM. Wikipedia's SDRAM article also explains how burst transfers of whole cache lines from DRAM work.

I'm not sure whether Intel CPUs actually pass along an offset within a cache line for critical-word-first and early-restart back up the cache hierarchy. I'd guess not, because some of the closer-to-the-core data paths are much wider than 8 bytes, 64 bytes wide in Skylake.

See also Agner Fog's microarch pdf (https://agner.org/optimize/), and other links in the x86 tag wiki.

It totally makes sense the fetch being in 64 bytes, that solves the scenario where 128 bits are fetched for an address and this leaves a 3/4 empty cache line. Obviously there is some negotiation to fetch 3 more addresses with every address so that the original address ends up being fetched along with its 64 byte aligned chunk. Perhaps the memory controller does this itself in command generation. I am still not sure of the exact nature of the interaction between the caches and how the caches know what data was for what request though, but I did try to make a postulation in the original post. — Lewis Kelsey, Nov 05 '18 at 06:55
@LewisKelsey: If you read Ulrich Drepper's "What Every Programmer Should Know About Memory?" article, you'll see that one of the major features of SDRAM is that the controller sends a burst-transfer request and the data for the cache line arrives over the next 8 memory-bus cycles, 64 bits at a time. https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#SDRAM_burst_ordering. The request down the memory hierarchy contains the physical address of the required cache line. (And maybe an offset within the line to allow critical-word-first and early-restart.) — Peter Cordes, Nov 05 '18 at 07:00
I was aware of that .pdf and it is definitely something I should read at some point to pick out information that I was unaware of, but I just haven't got round to it yet. Anyway, thanks. — Lewis Kelsey, Nov 05 '18 at 07:28
Okay, I just realised I had a major misapprehension. DIMMs have 64 pins for I/O, which is just glaringly obvious given that the channel width is 64 bits on multiple diagrams but I had fallen under some misapprehension of there being 64 input pins and 8 outputs, largely due to a diagram that threw me off once upon a time, but also due to the fact I thought memory locations in a column store 8 bits each so it made sense in my mind at the time, but actually there are 64 bits to a memory location in a column. Which means 64 bits as opposed to 8 bits are transmitted on a clock edge.. — Lewis Kelsey, Nov 05 '18 at 08:06
..hence 64 _bytes_ are transmitted per prefetch x8 burst (and obviously, one cache line is fetched per memory access with DDR3). I have now corrected this in the original post. — Lewis Kelsey, Nov 05 '18 at 08:07
@LewisKelsey: yup, exactly. I edited my answer a couple minutes ago to address another thing about how data from DRAM goes back up the memory hierarchy: it doesn't have to be in request order. — Peter Cordes, Nov 05 '18 at 08:09
The WikiChip uarch [diagram](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)) (which is based on documents from Intel) shows the connections between the L1I, L1D, L2, and L3. The L1I LFBs are definitely different from the L1D LFBs, but we don't know their number. The L2 has one unidirectional 64-byte port with the L1I and one bidirectional 64-byte port with the L1D. Although I don't know whether the L2's superqueue can handle two misses per cycle. But we know that the L2 can handle 2 hits per cycle... — Hadi Brais, Nov 06 '18 at 02:20
...The L2 also has a bidirectional port (64-byte in Skylake, 32-byte in Haswell) with the L3 slice it is connected to. When the L2 controller receives a line from the L3, it immediately writes it into the corresponding superqueue entry (or entries). Another mystery is whether the superqueue can handle one (or two) misses together with writing an incoming cache line all in the same cycle. Requests at the L2 need to be tagged with the sender ID. Requests at the L3 need to be tagged with yet another sender ID. The requests at the L1I need not be tagged... — Hadi Brais, Nov 06 '18 at 02:21
...At the L1D, it depends on who is responsible for coalescing loads to same cache line. The caching agent and home agent are combined into a single unit called CHA in Skylake processors. — Hadi Brais, Nov 06 '18 at 02:21
My understanding regarding the TLBs is that they don't have their own dedicated ports to the caches; they have to arbitrate for the same ports. — Hadi Brais, Nov 06 '18 at 02:26
Load request coalescing is also necessary at the L2 and L3 because the L2 might receive at most two requests (from the L1I and L1D) to the same cache line and an L3 slice might receive N requests (which can miss) to the same cache line where N is equal to the number of cores per NUMA node. — Hadi Brais, Nov 06 '18 at 02:32
The DDR4 max burst length is 8. Although the DDR5 max burst length could be larger, I'm not sure. I don't have access to the DRR5 standard. — Hadi Brais, Nov 06 '18 at 02:36
@LewisKelsey The total number of bits across all columns in the same bank from all chips in the same rank should be equal to 64 bits. So the number of bits per column is then 64 divided by the number of chips. — Hadi Brais, Nov 06 '18 at 02:38
@HadiBrais: Thanks for the comments, incorporated / quoted some of them in an update. — Peter Cordes, Nov 06 '18 at 02:59
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Section B.3.5.1-4 and B.3.6 has also answered some of my questions — Lewis Kelsey, Nov 06 '18 at 04:46
And yeah, distributed CHAs are used on Intel's mesh topology, so their server CPUs — Lewis Kelsey, Nov 06 '18 at 05:18
https://hal.inria.fr/hal-01403066/document Another resource relating to QHL and GQs — Lewis Kelsey, Nov 11 '18 at 07:54

What happens with the IFU and the front end when an instruction is not in L1I?

1 Answers1