Is L2 line fill always triggered on lookup?

Question

It's a well-documented fact that L2 is non-unclusive with respect to L1D meaning that L2 does not have to contain all lines L1DCache has.

Can L1d miss (Read, RFO) that also misses L2 fill the L1d line without filling the corresponding L2 line? Is there any explanation of that in Intel mans? Update: There is. Intel Vol.3, section about memory type.

Or rephrasing the question in another way: Does a lookup missing L2 always cause its line to be filled?

After some digging in I discovered the answer by myself. It is a property of Write-back memory type, not a cache level

Write-back (WB) — Writes and reads to and from system memory are cached. Reads come from cache lines on cache hits; read misses cause cache fills.

Peter Cordes · Accepted Answer · 2020-05-17T04:16:27.550

The answer depends on the cache inclusion policy of the outer caches. We can safely assume that read-allocate happens in any cache level unless otherwise specified (exclusive or victim cache).

On Intel, NT prefetch can bypass L2 (just filling L1d and a single way of L3, for example, on Intel CPUs with inclusive L3), but normal demand loads are fetched through L2 and do allocate in L2 as well as L1d. (And SW prefetch other than prefetchnta)

The above applies to most CPUs (NINE L2). But some microarchitectures have exclusive L2/L1d and thus no, only allocating in L1d at first, with the line moving to L2. AMD has experimented more with exclusive caches than Intel.

AMD has built some CPUs with exclusive and/or victim caches, e.g. Zen's per-CCX L3 is a victim cache for the L2 caches in that complex of 4 cores (https://en.wikichip.org/wiki/amd/microarchitectures/zen#Memory_Hierarchy, https://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/9). Skylake-X / Cascade Lake's non-inclusive L3 is also a victim cache for L2.

In those CPUs, reads don't allocate in L3, only L2 and L1d. (Or L1i for code fetches).

Barcelona (aka K10) has a shared L3, and an L1/L2 that are exclusive of each other (source: David Kanter's excellent writeup). So on K10, yes a line allocated in L1d will definitely not be allocated in L2. The line evicted from L1d to make room for the new line will typically be moved to L2, evicting an older line from L2.

K8 had the same L2 exclusive of L1d, but no shared L3.

Also related: Which cache mapping technique is used in intel core i7 processor?

It is a property of Write-back memory type, not a cache level ... read misses cause cache fills.

Intel's vol.3 manual is just abstract guarantees that are future proof. That's only guaranteeing that it will be cache somewhere in the cache hierarchy.

For any sane design that will include in L1d in anticipation of other reads of the same line (immediate spatial locality is very common). But it doesn't have to include L2 or even L3 right away, depending on the design. i.e. it doesn't mean all levels.

x86 doesn't guarantee anything on paper about having more than one level of cache. (Or even that there is a cache, except for the parts of the ISA docs about cache-as-RAM mode and stuff like that.) The docs are written assuming a CPU with at least 2 levels because that's been the case since P6 (and P5 with motherboards that provided an L2 cache), but anything like clflush should be read as "assuming there is a cache".

Yeah, I suppose _read misses cause cache fills_ that I just updated the question with were about demand, not prefetch. Thank you. — Some Name, May 17 '20 at 03:05
@SomeName: Note that on paper WB-cacheable just means that reads and writes will be cached *somewhere* in the cache hierarchy. For any sane design that will include in L1d in anticipation of other reads of the same line, but it doesn't have to include L2 right away. Again, a sane design that has an L2 will be able to make use of it for a read-only workload, which means that reads have to be able to eventually populate L2, but that could be via eviction of clean lines from L1d into an L2 victim cache. — Peter Cordes, May 17 '20 at 03:11
@SomeName: I'm not sure a demand RFO would want to allocate in L2. The line is about to be dirtied into Modified state, so having the original in L2 isn't useful. Since L2 is NINE (not inclusive, not exclusive) on Intel CPUs, they'd certainly be allowed not to allocate, and just delay doing anything until eventual eviction / write-back or whatever. — Peter Cordes, May 17 '20 at 03:20
@SomeName: Also, updated: AMD K10 definitely skips L2 because it has an L2 exclusive of L1d. — Peter Cordes, May 17 '20 at 03:21
In this sense caching data only in the LLC and missing L1d and L2 any time the data is required, for example, does not contradict to the definition of WB-cacheable memory type. But such design seems useless. — Some Name, May 17 '20 at 03:31
@SomeName: Yeah like I said, any sane design will definitely allocate in L1d on a load or store, regardless of how outer levels of cache work. Only [`prefetcht2` or `prefetcht1`](https://www.felixcloutier.com/x86/prefetchh) or something like that would bring data into only L3 or L2+L3 but not L1d. — Peter Cordes, May 17 '20 at 03:34
@SomeName: Cheers. Added a new TL:DR line at the top and linked Wikipedia's cache inclusion policy page, which is very relevant to the general idea you're wondering about here. — Peter Cordes, May 17 '20 at 04:18

Is L2 line fill always triggered on lookup?

1 Answers1

Linked