The L1/L2 cache are inclusive in Intel
Intel x86 processors with respect to cache inclusivity fall into one of the following categories:
- There are three levels of caches. The L3 is inclusive of the L2 and L1. The L2 is NINE of the L1 (Not Inclusive, Not Exclusive). This category includes all of the following processors: (1) All client processors that implement the Core microarcitecture up to and including Rocket Lake, except for the Core X and Xeon W processor series designed for the client market segment. This also includes the Xeon W-10000 series for the client segment. (2) All server processors that implement the Core microarcitecture up to and including BDX, and (3) All Xeon E3, Xeon E, and Xeon W-1200 processors.
- There area two levels of caches. The L2 is NINE of the L1. All Atom processors (including Tremont) belong to this category. All old Intel processors (with two cache levels) also belong here.
- There are two levels of caches. The L2 is inclusive of the L1D and NINE of the L1I. KNL and KNM rocessors belong here. The information available for KNC and KNF says that the L2 is inclusive of the L1, although this could be inaccurate and the L2 may be only inclusive of the L1D on these processors too. See below for MCDRAM.
- There are three levels of caches. The L3 and the L2 are both NINE. This category includes all of the following processors: (1) All Pentium 4 processors with three levels of caches, (2) All generations of Xeon SP processors, (3) Xeon D-2100, Skylake Core X series processors, Skylake Xeon W series processors, which all use the SKX uncore rather than the SKL uncore, and (4) All Tiger Lake processors.
- Lakefield processors have a three-level cache hierarchy. The 4 Tremont cores share a NINE L2 and the Sunny Cove core has its own NINE L2. All of the 5 cores share an LLC that can be configured as either inclusive or NINE.
Some processors have an L4 cache or a memory-side cache. These caches are NINE. In KNL and KNM, if MCDRAM is fully or partially configured to operate in cache mode, it's modified-inclusive of the L2 (and therefore the L1), meaning that inclusivity only applies to dirty lines in the L2 (in the M coherence state). On CSL processors that support Optane DIMMs, if the PMEM DIMMs are fully or partially configured to operate in cache mode, the DRAM DIMMs work as follows:
The Cascade Lake processor uses a novel cache management scheme using
a combination of inclusive and noninclusive DRAM cache to reduce DRAM
band-width overhead for writes while also eliminating the complexity
of managing invalidates to processor caches on the eviction of an
inclusive line from DRAM cache.
according to Cascade Lake: Next Generation Intel Xeon Scalable Processor.
The MCDRAM cache in KNL/KNM and DRAM cache in CSL do not fall in any of the three traditional inclusivity categories, namely inclusive, exclusive, and NINE. I think we can describe them as having "hybrid inclusivity."
AMD processors:
- Zen family: The L2 is inclusive and the L3 is NINE.
- Bulldozer family: The L2 is NINE and the L3 NINE.
- Jaguar and Puma: The L2 is inclusive. There is no L3.
- K10 and Fusion: The L2 is exclusive. There is no L3.
- Bobcat: I don't know about the L2. There is no L3.
- K7 (models 3 and later) and K8: The L2 is exclusive. There is no L3.
- K7 (models 1 and 2) and older: The L2 is inclusive. There is no L3.
No existing AMD processor has an L4 cache or a memory-side cache beyond the L3.
VIA processors:
- Nano C and Eden C: I don't know about the L2. There is no L3.
- All older processors: The L2 is exclusive. There is no L3.
This covers all current VIA processors.
and L1 / L2 cache is 8 way associativity, means in a set there are 8
different cache lines exist.
This is true on most Intel processors. The only exception is the NetBurst microarchitecture where a single L2 way holds two adjacent cache lines, collectively called a sector.
An associativity of 8 is typical, but it's not uncommon to have different associativities. For example, the L1D in Sunny Cove is 12-way associative. See: How does the indexing of the Ice Lake's 48KiB L1 data cache work?.
The cache lines are operated as a whole, means if I want to remove few
bytes from a cache line, the whole cache line will be removed , not
the only those bytes which I want to remove. Am I right ?
Right, this is due to a limitation in the coherence state associated with each cache entry of each cache level. There is only one state for all of the bytes of a cache line.
does system store the evicted data of that cache line somewhere (in
any buffer, register etc) so that next time it can load the data from that place to reduce the latency
There are several factors that impact this decision: (1) whether the line is dirty, (2) the inclusivity properties of the higher-numbered cache levels, if any, (3) whether the line is predicted to be accessed in the near future, and (4) if I remember correctly, if the memory type of a line changed from cacheable to uncacheable while it's resident in a cache, it'll be evicted and not cached in any other levels irrespective of the previous factors.
So a lazy answer that works for all processors is "maybe."