Page-Structure Cache perf events

Question

I've been always thinking that if linear address translation process encounters TLB miss then it traverses page directory structure in memory. However Intel Manual Vol.3/4.10.3 defines the so called Paging-Structure Caches which I've not heard about before.

This is what it's done for TLB miss:

If the processor does not find a relevant TLB entry or PDE-cache entry, it may use the upper bits of the linear address (for 4-level paging, bits 47:30; for 5-level paging, bits 56:30) to select an entry from the PDPTE cache that is associated with the current PCID. It can then use that entry to complete the translation process (locating a PDE, etc.) as if it had traversed the PDPTE, the PML4E, and (for 5-level paging) the PML5E corresponding to the PDPTE-cache entry

and

If the processor does not find a relevant TLB entry, PDE-cache entry, or PDPTE-cache entry, it may use the upper bits of the linear address (for 4-level paging, bits 47:39; for 5-level paging, bits 56:39) to select an entry from the PML4E cache that is associated with the current PCID. It can then use that entry to complete the translation process (locating a PDPTE, etc.) as if it had traversed the corresponding PML4E.

So TLB miss does not necessarily means traversing the whole page structure.

Could you give some examples of perf events describing the Page-Structure Caches access and how to optimize for Page-Structure Cache usage?

My recollection is that neither AMD nor Intel have provided any significant details about the implementation of their page-structure caches. Even if one reverse-engineers implementation details for one processor model, the implied optimizations for page-structure cache usage may not apply to other models. The good news is that the existing implementations handle page table accesses extremely efficiently (especially if large pages are used). On a Xeon Platinum 8380 (Ice Lake Xeon), random accesses in tables up to ~160 GiB will find most of their PTEs in the L2 cache (using 2MiB pages). — John D McCalpin, Oct 25 '22 at 15:37

score 1 · Accepted Answer · answered Oct 25 '22 at 13:38

AFAIK, Skylake doesn't have any perf events for the details of page walks. There are counters for number of walks completed, and number of cycles with walks active, so I guess you could try to average how long each walk took.

(There are two PMH page-miss handlers in Skylake and later, but dtlb_load_misses.walk_pending counts 1 or 2 per cycle depending on how many are active. Or 0 for neither. But it might only be counting for demand-load TLB misses, not next-page TLB prefetch. There are similar events for stores and code-fetch. Some other events like dtlb_load_misses.walk_active counts cycles when one or both page-walkers are active.)

The main way to take advantage of the page walkers caching higher levels of the page table (and/or L2 / L1d also caching those physical locations) is to have locality on a larger scale, like have the hot pages in your working set within the same aligned 2M or 1G regions, so they all share a common upper part of the radix tree (page tables).

Or within a few groups; you don't need to try to get malloc / mmap to allocate next to your code or the stack.

That's mostly up to your OS, unless you do one big allocation and carve it up yourself.

Static code/data (at least in a non-PIE Linux executable) starts at absolute address 4MiB by default, which is at the start of a 2M largepage, and near the start of a 1G hugepage. And very near the start of the 2^9 G level above that. So even if you have a lot of code + data, it's well-positioned. I assume ASLR for non-PIE executables is more granular, but static code+data is usually pretty small compared to even a 1G level.

Page-Structure Cache perf events

1 Answers1