L1 caches usually have split design, but L2, L3 caches have unified design, why?

Question

I was reading the pros and cons of split design vs unified design of caches in this thread.

Based on my understanding the primary advantage of the split design is: The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. And the primary disadvantage is: Combined space of the instruction and data caches may not be efficiently utilized. Simulations have shown that a unified cache of the same total size has a higher hit rate.

I, however, couldn't find an intuitive answer to the question "Why (at-least in most modern processors) L1 caches follow the split design, but the L2/L3 caches follow the unified design.)"

Peter Cordes · Answer 1 · 2021-03-19T22:15:18.660

3

Most of the reason for split L1 is to distribute the necessary read/write ports (and thus bandwidth) across two caches, and to place them physically close to data load/store vs. instruction-fetch parts of the pipeline.

Also for L1d to handle byte load/store (and on some ISAs, unaligned wider loads/stores). On x86 CPUs which want to handle that with maximum efficiency (not an RMW of the containing word(s)), Intel's L1d may only use parity, not ECC. L1i only has to handle fixed-width fetches, often something simple like an aligned 16-byte chunk, and it's always "clean" because it's read-only, so it only needs to detect errors (not correct), and just re-fetch. So it can have less overhead for each line of data, like only a couple parity bits per 8 or 16 bytes.

See Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: it being impossible to build one large unified L1 cache with twice the capacity, same latency, and sum total of the bandwidth as a split L1i/d. (At least prohibitively more expensive for power due to size and number of read/write ports, but potentially actually impossible for latency because of physical-distance reasons.)

None of those factors are important for L2 (or exist at all in the case of unaligned / byte stores). Total capacity that can be used for code or data is most useful there, competitively shared based on demand.

It would be very rare for any workload to have lots of L1i and L1d misses in the same clock cycle, because frequent code misses mean the front end stalls, and the back-end will run out of load/store instructions to execute. (Frequent L1i misses are rare, but frequent L1d misses do happen in some normal workloads, e.g. looping over an array that doesn't fit in L1d, or a large hash table or other more scattered access pattern.) Anyway, this means data can get most of the total L2 bandwidth budget under normal conditions, and a unified L2 still only needs 1 read port.

@Hadi's answer that you linked does cover most of these reasons, but I guess it doesn't hurt to write a simplified / summary answer.

edited Mar 19 '21 at 22:15

answered Oct 03 '20 at 13:07

Peter Cordes

328,167
45
605
847

Just noticed this. Good summary. But I'm trying to wrap my head around the part about the byte loads/stores. You can certainly design a unified cache that supports unrestricted addressing. Addressing the L1I is simpler. For example, in Intel processors, all fetches into the instruction byte buffer are 16-byte aligned, so the IFU can omit the lowest 4 bits of the physical address when looking up the IFU memory structures (L1I, victim cache, ISB). This results in slightly less area and power compared to a unified design but I don't know of anyone that considers this to be a significant saving. – Hadi Brais Mar 19 '21 at 21:43
@HadiBrais: Hmm, now that I think about it, if you did have a unified cache with twice the size and the aggregate total of read ports, the instruction-fetch read port could still be simpler. At least for reading, most of the work of handling unaligned-within-line is in hardware that exists once per read port, not once per line of data. And for writing, IDK if there's much saving in addressing. – Peter Cordes Mar 19 '21 at 21:53
@HadiBrais: But the point about ECC stands: if you want to be able to update any individual collection of bytes, you either need word-RMW when not writing a full ECC granule, or your ECC granules need to be 1B (high overhead), or you need to use just parity like it's rumoured that Intel does for L1d. That cost scales with the array size, so having half of your L1 cache be I-cache lets that half use more efficient ECC. Perhaps you were separating this from the other machinery of byte / unaligned load/store. – Peter Cordes Mar 19 '21 at 21:54
Yeah it's valid (and I have not mentioned this in my answer). The number of data accesses is usually much larger than the number of L1I accesses, so the L1D may require ECC-level protection but parity may be sufficient for the L1I. With the unified design, every entry would require ECC, significantly increasing area and power overhead (and possible degrading performance) compared to split. Do you know of any real processor that uses ECC for the L1I? I can't seem to remember any. – Hadi Brais Mar 19 '21 at 22:11
@HadiBrais: Oh right, I was forgetting that L1i is special because it's never dirty: it can just re-read if it detects an error. So yeah, normally just parity sounds right. – Peter Cordes Mar 19 '21 at 22:12
1

It's likely that the L1D uses ECC and not parity in most processors (not just those from Intel). I remember discussing with you a tool on Linux that shows what error detection technique is used in each cache level (but we were not sure where the tool is getting the data from). I couldn't find the discussion (I think it's in the comment section of some related Q/A). Anyway, I remember the tool reporting ECC for the L1D, which is very likely correct. – Hadi Brais Mar 19 '21 at 22:23

L1 caches usually have split design, but L2, L3 caches have unified design, why?

1 Answers1

Linked