7

The L1/L2 cache are inclusive in Intel and L1 / L2 cache is 8 way associativity, means in a set there are 8 different cache lines exist. The cache lines are operated as a whole, means if I want to remove few bytes from a cache line, the whole cache line will be removed , not the only those bytes which I want to remove. Am I right ?

Now, my question is whenever a cache line of a set is removed/evicted from cache, either by some other process or by using clflush(manual eviction of a cache line/block ), does system store the evicted data of that cache line somewhere (in any buffer, register etc), so that next time it can load the data from that place to reduce the latency as compared to loading the data from main memory or higher level of cache, OR it ALWAYS simply invalidate the data in cache and next time loaded the data from next higher level.

Any suggestion or any link for the article will be highly appreciated. Thanks in advance.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
bholanath
  • 1,699
  • 1
  • 22
  • 40

2 Answers2

6

L1/L2 are not necessarily inclusive, only the last-level cache is known to be so, which on i7 would be the L3. You are right in saying that a cache line is the basic caching element, you would have to throw a whole cacheline in order to fill in a new one (or when invalidating that single line). You can read some more about that here - http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-10.html

When a line is removed, the action taken depends on its MESI state (MESI and its derivatives are the protocols for cache coherency maintenance). If the line if modified, ("M") then the data must be "written-back" to the next level cache (in case of a miss it may allocate there, or "write-through" on to the next level - depends on the policy that cache maintains). Note that when you reach the last level cache you would have to hit as it's inclusive. When evicting a line from the last level cache - it would have to get written to the memory. Either way, failing to write back a modified line would result in loss of coherency, which would most likely result in incorrect execution.

If the line is not modified (Invalid, Exclusive or Shared), than the CPU may silently drop it without need of writeback, thereby saving bandwidth. By the way, there are also several other states in more complicated cache protocols (like MESIF or MOESI).

You can find lots of explanations by googling for "cache coherence protocols". If you prefer a more solid source, you can refer to any CPU architecture or cache design textbook, I personally recommend Hennessy&Patterson's "Computer Architecture, a quantitative approach", there's a whole chapter on cache performance, but that's a bit off topic here.

Small update: as of Skylake, some of the CPUs (server segments) no longer have an inclusive L3, but rather a non-inclusive (to support an increased L2). This means that clean lines are also likely to get written back when aging out of the L2, since the L3 does not normally hold copies of them.

More details: https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/4

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • thanks leeor for answering. I got this link where they say both L2 and L3 are inclusive. http://www.bit-tech.net/hardware/cpus/2009/09/08/intel-core-i5-and-i7-lynnfield-cpu-review/ how can we confirm inclusive/exclusive property of L1/L2/L3 in our own system ? Is there anyway in command line or we need to follow intel architecture manual ? – bholanath Oct 24 '13 at 11:03
  • *This means that clean lines are also likely to get written back*. Did you mean *dirty* lines are more likely to go straight to DRAM when evicted from L2? I don't think SKX wastes bandwidth writing back clean lines. But anyway, interesting. I would have guessed that L2 evictions would still allocate in L3 instead of bypassing it, so a later read of that data could potentially hit in L3. Not doing that would make L3 a read-only cache, except for dirty lines requested by other cores. – Peter Cordes Sep 25 '18 at 11:07
  • Oh, earlier in your answer, you do mention write-allocate policy. I think L2 and L3 in Intel CPUs are always write-allocate for write-back from inner caches, regardless of inclusivity. (And yes, L2 is not-inclusive not-exclusive, aka NINE. And so is SKX's L3). Presumably SKX has a tag-inclusive structure or some kind of snoop-filter mechanism to avoid broadcasting invalidate requests to all inner caches for every load from DRAM. – Peter Cordes Sep 25 '18 at 11:17
  • @HadiBrais, a clean line doesn't have to be evicted for correctness, so i'm being careful here - some dead block prediction mechanisms for example may predict some lines can be silently dropped. As for loss of coherence, I stand corrected. It's keeping the M line without blocking other reads that would cause a coherence issue. – Leeor Sep 26 '18 at 05:34
  • I was talking about cache coherence in the sense of invalid MESI states, (e.g., M+E on different cores). I thought the point you made was that dropping modified data doesn't result in an invalid state, just wrong values. Anyway, let's not nitpick, you're welcome to edit as you see fit. – Leeor Sep 26 '18 at 07:43
  • As for the evictions - the quote doesn't make too much sense (if a line is evicted, it's... evicted?), but the article doesn't mention dead block prediction. Non-inclusiveness doesn't require it, it just *allows* that to be done efficiently, because you can silently drop lines from the L2 and not worry about the writeback. – Leeor Sep 26 '18 at 07:44
  • There was a followup question about your update: [Writeback of clean cache lines in Skylake?](https://stackoverflow.com/q/64754316) - if you meant that it allocates space for them in L3 on eviction from L2 (like a victim cache), I don't think that's correct. It would be a valid design, but I don't think SKX does that. – Peter Cordes Nov 10 '20 at 07:42
  • 1
    @PeterCordes, why don't you agree? Note that I didn't say you always allocate evictions, only *likely* to. The actual behavior is likely predictor-driven, as can be inferred from their description in the latest (2020) optimization manual: "Based on the access pattern, size of the code and data accessed, and sharing behavior between cores for a cache block, the last level cache may appear as a victim cache of the mid-level cache" (https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.html) – Leeor Nov 16 '20 at 09:57
  • In other words: They may learn from L2 evictions that show further reuse after being evicted, and later preserve evictions with similar characteristics (access patterns, sharing...) by allocating them in the L3. As to which characteristics - I don't suppose they'll disclose that, but this paper shows a possible implementation of this predictor - https://dl.acm.org/doi/abs/10.1145/2000064.2000075 – Leeor Nov 16 '20 at 10:05
  • Meant to reply sooner. I had forgotten or never noticed that SKX L3 could and would sometimes act as a victim cache. That makes sense given its size relative to L2. – Peter Cordes Dec 16 '20 at 03:47
  • @PeterCordes, I suppose that was the main reason to switch from inclusive to non-inclusive L3. If it's not there to store the lines you *have* in the core, then you have to store the ones you *used* to have (otherwise you're left with lines you're *going* to have, and judging by their stock Intel isn't that good at predicting the future anymore) – Leeor Dec 22 '20 at 13:27
3

The L1/L2 cache are inclusive in Intel

Intel x86 processors with respect to cache inclusivity fall into one of the following categories:

  • There are three levels of caches. The L3 is inclusive of the L2 and L1. The L2 is NINE of the L1 (Not Inclusive, Not Exclusive). This category includes all of the following processors: (1) All client processors that implement the Core microarcitecture up to and including Rocket Lake, except for the Core X and Xeon W processor series designed for the client market segment. This also includes the Xeon W-10000 series for the client segment. (2) All server processors that implement the Core microarcitecture up to and including BDX, and (3) All Xeon E3, Xeon E, and Xeon W-1200 processors.
  • There area two levels of caches. The L2 is NINE of the L1. All Atom processors (including Tremont) belong to this category. All old Intel processors (with two cache levels) also belong here.
  • There are two levels of caches. The L2 is inclusive of the L1D and NINE of the L1I. KNL and KNM rocessors belong here. The information available for KNC and KNF says that the L2 is inclusive of the L1, although this could be inaccurate and the L2 may be only inclusive of the L1D on these processors too. See below for MCDRAM.
  • There are three levels of caches. The L3 and the L2 are both NINE. This category includes all of the following processors: (1) All Pentium 4 processors with three levels of caches, (2) All generations of Xeon SP processors, (3) Xeon D-2100, Skylake Core X series processors, Skylake Xeon W series processors, which all use the SKX uncore rather than the SKL uncore, and (4) All Tiger Lake processors.
  • Lakefield processors have a three-level cache hierarchy. The 4 Tremont cores share a NINE L2 and the Sunny Cove core has its own NINE L2. All of the 5 cores share an LLC that can be configured as either inclusive or NINE.

Some processors have an L4 cache or a memory-side cache. These caches are NINE. In KNL and KNM, if MCDRAM is fully or partially configured to operate in cache mode, it's modified-inclusive of the L2 (and therefore the L1), meaning that inclusivity only applies to dirty lines in the L2 (in the M coherence state). On CSL processors that support Optane DIMMs, if the PMEM DIMMs are fully or partially configured to operate in cache mode, the DRAM DIMMs work as follows:

The Cascade Lake processor uses a novel cache management scheme using a combination of inclusive and noninclusive DRAM cache to reduce DRAM band-width overhead for writes while also eliminating the complexity of managing invalidates to processor caches on the eviction of an inclusive line from DRAM cache.

according to Cascade Lake: Next Generation Intel Xeon Scalable Processor.

The MCDRAM cache in KNL/KNM and DRAM cache in CSL do not fall in any of the three traditional inclusivity categories, namely inclusive, exclusive, and NINE. I think we can describe them as having "hybrid inclusivity."


AMD processors:

  • Zen family: The L2 is inclusive and the L3 is NINE.
  • Bulldozer family: The L2 is NINE and the L3 NINE.
  • Jaguar and Puma: The L2 is inclusive. There is no L3.
  • K10 and Fusion: The L2 is exclusive. There is no L3.
  • Bobcat: I don't know about the L2. There is no L3.
  • K7 (models 3 and later) and K8: The L2 is exclusive. There is no L3.
  • K7 (models 1 and 2) and older: The L2 is inclusive. There is no L3.

No existing AMD processor has an L4 cache or a memory-side cache beyond the L3.

VIA processors:

  • Nano C and Eden C: I don't know about the L2. There is no L3.
  • All older processors: The L2 is exclusive. There is no L3.

This covers all current VIA processors.


and L1 / L2 cache is 8 way associativity, means in a set there are 8 different cache lines exist.

This is true on most Intel processors. The only exception is the NetBurst microarchitecture where a single L2 way holds two adjacent cache lines, collectively called a sector.

An associativity of 8 is typical, but it's not uncommon to have different associativities. For example, the L1D in Sunny Cove is 12-way associative. See: How does the indexing of the Ice Lake's 48KiB L1 data cache work?.

The cache lines are operated as a whole, means if I want to remove few bytes from a cache line, the whole cache line will be removed , not the only those bytes which I want to remove. Am I right ?

Right, this is due to a limitation in the coherence state associated with each cache entry of each cache level. There is only one state for all of the bytes of a cache line.

does system store the evicted data of that cache line somewhere (in any buffer, register etc) so that next time it can load the data from that place to reduce the latency

There are several factors that impact this decision: (1) whether the line is dirty, (2) the inclusivity properties of the higher-numbered cache levels, if any, (3) whether the line is predicted to be accessed in the near future, and (4) if I remember correctly, if the memory type of a line changed from cacheable to uncacheable while it's resident in a cache, it'll be evicted and not cached in any other levels irrespective of the previous factors.

So a lazy answer that works for all processors is "maybe."

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • Skylake L2 is either 256kiB / 4-way in SKL client (down from 8 in Broadwell) or 1MiB / 16-way in SKX server (used in some high-end i7 models). Interesting, I didn't remember reading SKX had inclusive L2 caches. https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server) doesn't mention that. https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/4 claims Skylake client and server both have inclusive L2 caches (but I wouldn't trust it very much, IIRC there were other inaccuracies in it.) – Peter Cordes Jan 22 '20 at 23:22
  • Intel's current optimization manual (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf) says nothing about SKX having inclusive L2. [According to Kanter on RWT](https://www.realworldtech.com/haswell-cpu/5/), Haswell has non-inclusive (NINE) L2, so that would be a change for SKX but we see no mention of that in anything I've looked at. – Peter Cordes Jan 23 '20 at 03:20
  • @PeterCordes Good catch, dude! The `cpuid` leaf 4 dumps available from [InstLatx64](https://github.com/InstLatx64/InstLatx64/tree/master/GenuineIntel) show the L2 (and L3) in SKX is non-inclusive (and there is no `cpuid` errata). It was widely reported by non-Intel sources that the L2 is inclusive in these processors. Even numerous research papers mention that the L2 is inclusive. It didn't occur to me that this could be wrong. – Hadi Brais Jan 23 '20 at 08:04
  • 1
    Weird, I wonder where that bit of misinformation originated. It seemed surprising to me since it's not shared. The only reason I could see for making it inclusive would be if the snoop filter was only probabilistic, then inclusive L2 could insulate L1 from some invalidations / write-back requests for lines that core doesn't have. But that wouldn't apply often enough to matter if the snoop filter keeps full track of everything. And L2 doesn't use a large line size. (BTW, https://en.wikipedia.org/wiki/CPU_cache#Exclusive_versus_inclusive lists some possible advantages). – Peter Cordes Jan 23 '20 at 08:27
  • I noticed that Wikichip lists Zen and Zen 2 L2 as "inclusive of L1" (https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Memory_Hierarchy). If that's real, then maybe that was the source of a mixup with Intel uarches? I don't see what Zen 2 gains from it, though. Could they maybe broadcast invalidates within a CCX instead of having perfect snoop filtering? Or was that somehow related to L1i being 64k / 4-way and thus needing tricks to be VIPT in Zen 1 / Zen+? – Peter Cordes Jan 29 '20 at 23:54
  • @Peter - I'm pretty sure the snoop filter behavior is non-exact on all recent chips, because of the possibility of silent cast-out from the L2. – BeeOnRope Jan 30 '20 at 03:19
  • @BeeOnRope: That wasn't what I had in mind. I was wondering if it was possible to have an even weaker but still useful snoop filter that could maybe only keep track of some fraction of the lines, and might have to fall back to broadcasting an invalidate for lines where it didn't know if they might be present or not. e.g. aren't the snoop filters in Xeon E7 series CPUs (BDW and earlier) that filter snoops between sockets too small to track everything, and only end up tracking some hot remote lines? Or maybe the reason for an inclusive L2 isn't directly to do with snooping at all. – Peter Cordes Jan 30 '20 at 03:34
  • @PeterCordes - yes, that's what I understood, but I was referring to: _The only reason I could see for making it inclusive would be if the snoop filter was only probabilistic, then inclusive L2 could insulate L1 from some invalidations / write-back requests for lines that core doesn't have._ Well, even if the snoop-filter is not probabilistic, it is still not exact due to silent cast outs, and in the case of silent cast outs an inclusive L2 can still insulate L1 from snoops. – BeeOnRope Jan 30 '20 at 05:16
  • I don't know how the snoop filters work in E7, but aren't we searching for a reason for something that doesn't exist? I though the comments above conclude that Intel chips are generally NINE in their L2, so we don't need to guess at reasons why they have inclusive L2, because they don't? – BeeOnRope Jan 30 '20 at 05:18
  • @BeeOnRope: My last couple comments were about why *Zen / Zen 2* have inclusive L2 (assuming wikichip is correct about that). Yes, Intel L2 caches are NINE. Good point about insulating L1 from some snoops just because of silent dropping of lines; I was assuming that was insignificant but maybe it's worth making L2 inclusive for that, if it's enough larger and associative enough like it is in Zen. – Peter Cordes Jan 30 '20 at 05:28
  • 1
    @PeterCordes There is another potential benefit to making a writeback cache inclusive besides filtering snoops, which is to enable the writeback cache to handle writebacks from a lower-numbered cache efficiently. This is possible because a writeback can never miss in the higher-numbered inclusive cache, so there is no need to handle this case in the design. (Note that this benefit doesn't apply to writethrough caches.) This is precisely why cache-mode MCDRAM in KNL/KNM is modified-inclusive. Regarding Zen/Zen2, the AMD manual does say that the L2 is inclusive. – Hadi Brais Jan 30 '20 at 09:36
  • Oh good point. I know Zen was essentially fully redesigned, but Bulldozer-family had write-through L1d with a small 4k buffer. So perhaps L1d writing to L2 was a sensitive issue or just on their minds because of that. But anyway, it was a problem they had to re-solve for Zen, throwing out whatever BD-family did. – Peter Cordes Jan 30 '20 at 09:43