Cache Locality - weight of TLB, Cache Lines, and ...?

Question

From my understanding the constructs which give rise to the high level concept of "cache locality" are the following:

Translation Lookaside Buffer (TLB) for virtual memory translation. Accessing the same virtual memory within the 4096 byte alignment (page size) will prevent the OS from needing to descending the hierarchical page table for translation.
Cache lines mean accessing the same virtual memory within 64 byte alignment (cache line size) will prevent the OS from needing to fetch from RAM for an instruction.

I have a few questions:

I have never once seen a quantitative estimate of the typical page table descent. Is this actually significant as measured in clock cycles?
I believe the 64 byte cache line refers to L1 cache lines - do L2 / L3 have different sizes? Under what circumstances is memory loaded into L2 / L3?
Are there any additional constructs which give rise to "cache locality" aside from cache lines and the TLB?

score 3 · Answer 1 · answered Feb 24 '21 at 17:19

There are very many additional performance-related features of a cache memory hierarchy that relate to the general topic of "cache locality". In a presentation back in 2007, I came up with 25 different kinds of locality that may need to be considered to understand performance of a memory-access-bound application! There are two versions of the presentation at http://dx.doi.org/10.13140/RG.2.2.12967.27048 -- with and without speaker notes. Slide 7 provides an (incomplete) list of "memory hierarchy locality domains". This includes locality related to address translation mechanisms, locality related to cache accesses, locality related to DRAM access, and a few other topics.

score 2 · Accepted Answer · answered Feb 24 '21 at 05:04

2

1. On most ISAs (including x86 which I suspect you're thinking of), hardware walks the page tables on a TLB miss, not the OS. The OS just puts the data structures in memory and gives the CPU the physical address of the top-level page directory. What happens after a L2 TLB miss?. Thus, page-walk can be done speculatively ahead of actually needing the TLB entry, ideally hiding most of the latency.

The actual latency for a load that suffers a TLB miss (but an L1d hit for the data) would tell you something about page-walk latency, on whatever microarch you're measuring. I don't have a number in mind for Skylake or anything; the practical cost would also depend on how much caching of higher levels of the page table was done inside the page-walk hardware. (So that's another source of locality; a page-walk within the same 1GiB as another recent page walk could be faster, even without just using a 1G or 2M huge/large page so one TLB entry can cover more address space.)

2. Some microarchitectures use larger lines for L2 or L3, but most don't. All modern x86 CPUs use 64B lines everywhere. (But L2 spatial prefetchers on Intel at least try to complete a 128-byte aligned pair of lines.) Line size of L1 and L2 caches

Under what circumstances is memory loaded into L2 / L3?

See also Is L2 line fill always triggered on lookup?

Depends on the cache inclusion policy, e.g. an exclusive outer cache won't have a copy of something just loaded into L1d, neither will a victim cache (wikipedia, although large L3 victim caches are not fully associative). In the x86 world, Intel hasn't used normally victim caches (Which cache mapping technique is used in intel core i7 processor?), but AMD has in some microarchitectures (e.g. Bulldozer-family). POWER has also used L3 victim caches.

Are there any additional constructs which give rise to "cache locality" aside from cache lines and the TLB?

Yes, DRAM "pages" (row size) means multiple cache misses within one page can let the DRAM controller avoid selecting a different row, and just read another column from an already open row. Changing rows increases DRAM latency beyond the normal cost.

What Every Programmer Should Know About Memory? covers DRAM, and lots of stuff about cache and optimizing for cache locality, and is still highly relevant.

Also, as mentioned above, page walks for nearby pages may be somewhat faster.

A large / hugepage (e.g. 2MiB on x86-64) lets one TLB entry cover the whole 2M.

Sequential reads (or writes) of consecutive cache lines trigger HW prefetchers to pull those lines into L2 (and even L1d) ahead of demand access, reducing miss latency. (Or avoiding misses altogether if the loop takes long enough on ALU work that HW prefetch can keep up with its accesses.)

answered Feb 24 '21 at 05:04

Peter Cordes

328,167
45
605
847

This is an amazing response filled with information and links, thank you so much! Will definitely parse through all of them :) This is a very high level follow up, but I’ve only taken a shallow undergraduate course in OS, and learned most other stuff through Googling. Would taking an OS or computer architecture course cover these concepts better? – gmaggiol Feb 24 '21 at 08:59
@gmaggiol: Yeah, an undergrad computer architecture course will probably cover some of the concepts, including cache inclusion policy for multi-level caches. See also [Modern Microprocessors A 90-Minute Guide!](http://www.lighterra.com/papers/modernmicroprocessors/) for a good intro to the CPU-pipeline side of computer architecture. – Peter Cordes Feb 24 '21 at 09:16
Deeper specific examples of how some actual modern CPUs work probably *wouldn't* be covered in most courses, because they're too complex to fully understand all the details, e.g. https://www.realworldtech.com/sandy-bridge/ deep-dive is still somewhat high level. (But very much worth reading to help give an idea of how the puzzle pieces fit together into a big-picture. Especially useful how David Kanter compares and contrasts Sandybridge against AMD Bulldozer and against earlier Intel (Nehalem); he's done other articles on Core 2, Bulldozer, K10, etc. – Peter Cordes Feb 24 '21 at 09:19
re:"a page-walk within the same 1GiB as another recent page walk could be faster," Is this because the PTE entries are more likely to be in L2? If not why? – Noah Feb 25 '21 at 09:26
1

@Noah: Because the page-walk hardware can (and I think does) cache higher-level PDEs, so misses in the same sub-tree may be able to shortcut some steps which would otherwise have to wait for the load-use latency of fetching the top-level entry and seeing where it points, etc. Clearing these is part of what `invlpg` has to do. e.g. [Andy Glew (who worked on P6) wrote](https://electronics.stackexchange.com/a/67985/65594) *Another related issue is that interior nodes of the page tables may be cached in still more TLB-like datastructures, e.g. the PDE-cache.* – Peter Cordes Feb 25 '21 at 09:31
@PeterCordes where is this cached? I am aware of PTE which I believe is populated via memory reads and the TLB which has a seperate purpose. (Scratch that comment on TLB coherency if you where responding to it, realizing its only OS that flushes which makes sense) – Noah Feb 25 '21 at 09:33
@Noah: In the dedicated page walk hardware, like I said; I don't know any more than that. A page-walker is dedicated hardware that responds to requests to read the page tables. It's a black box and can do whatever it wants, within limits of when Intel's / AMD's manuals say you need to use `invlpg`. e.g. fetch data from memory through L1d cache or not, or keep its own cache of what it read recently, even from adjacent PTEs or PDEs. – Peter Cordes Feb 25 '21 at 09:38
I see (did you edit you last comment? Didn't see the second half till just now! Sorry!). Are you saying there probably is some TLB like structure that will cache derefence of higher level PDE entries (that faster access than L2 cache)? Seems difficult to pull off esp. for higher level entries because it would require loading so many indexes. – Noah Feb 25 '21 at 09:40
@Noah: Andy Glew is saying that, and he worked at Intel on Pentium Pro. I'm just repeating that from him; IDK if I've seen any other source for it. Andy might just be talking in general terms about what CPUs (and CPU designers) are allowed to do in theory. But it sounds sensible- many page-walks have locality. But anyway, IDK why you're mentioning L2 cache; [What happens after a L2 TLB miss?](https://stackoverflow.com/q/32256250) mentions that Intel CPUs read page tables through L1d cache. Are you thinking of some AMD CPUs where page-walk bypasses L1d, only going through L2? – Peter Cordes Feb 25 '21 at 09:47
re:"a page-walk within the same 1GiB as another recent page walk could be faster," Yikes! I misread your other post and thought it was L2 for everything! One more thing if you have the time; Regarding the radix tree layout of the PTE. When a new PDE is added does it load all 512 possible destinations (size being < 2^48 just by fact that all unique PDE for a level are not loaded) or is the number of indexes are the next level variable? – Noah Feb 25 '21 at 10:06
@Noah: "it" being the page-walker? I'd assume a sane design would only read the one cache line containing the 8-byte entry it needs, and only maybe look at other valid entries in the same line (or 32-byte or 16-byte fetch block from L1d cache, depending on how wide a read port it uses, although maybe doing multiple reads of the line). It would make no sense for it to read the whole 4KiB containing that level of pointers or PTEs, usually just a lot of wasted memory traffic when the CPU may be stalling on this page walk. ([diagram of the radix tree](https://stackoverflow.com/q/46509152)) – Peter Cordes Feb 25 '21 at 10:15
2

@Noah The paging structure caches existed since the Core microarchitecture. Their architectural impact is discussed in the SDM V3 Section 4.10.2. NHM introduced EPT and there is a cache for EPT entries as well. Intel usually doesn't talk about these caches beyond what's in SDM V3, but the the presence of these caches can be detected by measuring page walk latencies since page walk is triggered when the lookup misses in last level TLB. – Hadi Brais Feb 25 '21 at 18:15
@PeterCordes The L3 in Bulldozer isn't truly exclusive despite of described as a victim cache by AMD, but you're right that it gets bypassed on cacheable L3 miss. – Hadi Brais Feb 25 '21 at 18:20
@PeterCordes I see. So then how does a given PTE entry resize? I.e lets say page fault on A creates a new mapping with a new A[47:39]. The tables of Levels 3, 2, and 1 pointed to from Level 4 index A[47:39] only contain C pointers s.t C < 512. (your saying 1 <= C <= 8 probably). If need to add mapping for B[47:39] s.t A[47:39] == B[47:39] (and lets that the level 3 they point to contains C pointers) it will require a reallocation? Does that also mean that at a given level the bits are not used to index but instead match via a scan (i.e its unknown if B[38:30] exists where B[47:39] points). – Noah Feb 25 '21 at 20:58
@PeterCordes I think that it is NOT variable size and a full page is loaded for any PF. [Every paging structure is 4096 Bytes in size and comprises a number of individual entries](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf#page=110) – Noah Feb 25 '21 at 22:07
@HadiBrais there is some pretty explicit information on the paging structure caches in [SMD V3 Section 4.10.3](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf#page=142). – Noah Feb 25 '21 at 22:26
1

@Noah: I never claimed anything was variable-size, except the amount of data the CPU might choose to cache during a single page-walk. No idea what you're talking about, of course architecturally everything in the page-tables has a fixed-size fixed-meaning. When we're talking about caching adjacent reads, that would be using a *separate* entry in whatever cache (TLB, or internal page-walker caches of higher levels.) – Peter Cordes Feb 25 '21 at 23:30
@PeterCordes I obviously didn't make my question clear. Sorry! – Noah Feb 25 '21 at 23:36
1

@Noah: more confirmation on the existence of internal page-walker caches in other non-x86 CPUs, e.g. ARM: [What is PDE cache?](https://stackoverflow.com/q/26945448). Happened to find that while looking for something and, and remembered this recent conversation where it had come up. – Peter Cordes Mar 12 '21 at 09:48

Cache Locality - weight of TLB, Cache Lines, and ...?

2 Answers2