Is there a limit on the number of hugepage entries that can be stored in the TLB

Question

I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like

cache and TLB information (2):
  0x63: data TLB: 1G pages, 4-way, 4 entries
  0x03: data TLB: 4K pages, 4-way, 64 entries
  0x76: instruction TLB: 2M/4M pages, fully, 8 entries
  0xff: cache data is in CPUID 4
  0xb5: instruction TLB: 4K, 8-way, 64 entries
  0xf0: 64 byte prefetching
  0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries

Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?

Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: https://stackoverflow.com/questions/40649655/how-is-the-size-of-tlb-in-intels-sandy-bridge-cpu-determined. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages. — Brian, Nov 07 '18 at 20:29
Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if https://www.7-cpu.com/cpu/Haswell.html is fully accurate. — Peter Cordes, Nov 08 '18 at 08:57

score 4 · Answer 1 · edited Nov 08 '18 at 22:21

Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.

Every TLB in every architecture has an upper limit on the number of entries it can hold.

For the x86 case this number is less than what you probably expected: it is 4.
It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.

It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.
Finally, TLBs are core resources, each core has its set of TLBs.
If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.

However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.
The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).
It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.

As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages). A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.

Worth mentioning that at least according to https://www.7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. https://www.7-cpu.com/cpu/Skylake.html. — Peter Cordes, Nov 08 '18 at 11:08
Thanks @PeterCorder, that's nice to know and have in the answer. — Margaret Bloom, Nov 08 '18 at 11:19
@PeterCordes: FYI, according to [this Intel document](https://www.intel.com/content/dam/develop/external/us/en/documents/run-perf-opt-bp-large-code-pages-q1update.pdf), the 16-entry 2nd level TLB cache for 1G pages was added in the Broadwell generation (see the top of page 4). — Jason R, Jan 16 '23 at 19:24

Is there a limit on the number of hugepage entries that can be stored in the TLB

1 Answers1