Why doesn't the Linux Kernel use huge pages?

Question

During my browsing I came across of a thing called hugepages, hugepages mechanism makes it possible to map 2M and even 1G pages using entries in the second and the third level page tables, and as the kernel docs itself says that the:

Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance.

I browsed the kernel source as well and I didn't see any usage of MAP_HUGETLB when it comes to mmap. In fact, /proc/sys/vm/nr_hugepages is set to 0 by default. Why is that? Does it mean the kernel has no need in huge pages at all? What are some examples of scenarios where huge pages are a must?

For the sake of example:

hugepage = mmap(0, getpagesize() * 4, PROT_WRITE | PROT_READ,
                    MAP_ANON | MAP_HUGETLB | MAP_PRIVATE, 0, 0);

Normally hugepages in user-spare are used via *transparent* hugepages, especially with `madvise(ptr, len, MADV_HUGEPAGE)`. https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html. The kernel itself use 1G hugepages to direct-map all of physical RAM (https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt), but the kernel doesn't make `mmap` system calls into itself. — Peter Cordes, Jun 06 '22 at 19:03
See also [Using mmap and madvise for huge pages](https://stackoverflow.com/q/30470972) for user-space, for private anonymous mappings. `/proc/meminfo` has stats on in-use `AnonHugePages`. And correction to my previous comment; it shows `DirectMap4k` and `DirectMap2M` as non-zero, but `DirectMap1G` is zero on my system. So it's only using 2M hugepages (aka largepages) for the direct-map region I guess, not 1G hugepages. — Peter Cordes, Jun 06 '22 at 19:06
Huge pages get used a lot in things like Databases, however for most non-specialized loads they have some severe downsides which is they don't play nice with swap (yes this is still a thing) and demand paging. The kernel can't just create the mapping when the program touches the memory, it has to lock down that much memory _contiguously_ already. So that makes managing and defragmenting ram harder. The best use case is a single user process uses them (ex: a database). Multiple processes using them is a problem. — Mgetz, Jun 06 '22 at 19:11
Notice that `mmap` is a system call provided and implemented in the kernel for application programs. The linux kernel does not use mmap. — Basile Starynkevitch, Jun 06 '22 at 19:15
@PeterCordes I can see 23G of `DirectMap1G` and 9G of `DirectMap2M` on my (32G RAM) system so it seems like those could be used for the direct mapping of all physical memory as you say. Not sure why only 23 x 1G instead of 32, but still. — Marco Bonelli, Jun 07 '22 at 02:21
@MarcoBonelli: Ok yeah, I thought Linux normally used 1G hugepages for that. My system also has 32G of physical RAM, x86-64 5.16 (Arch Linux, haven't updated and rebooted for a while). It's being weird with transparent hugepages not working at the moment, either, despite having it set to [madvise] and defrag on defer+madvise, even using a couple attempts including this known-good https://mazzo.li/posts/check-huge-page.html. A couple older processes were using hugepages, but newly-started ones aren't. Anyway, that's probably totally separate, just a recent annoyance. — Peter Cordes, Jun 07 '22 at 02:28

score 5 · Accepted Answer · answered Jun 07 '22 at 03:35

The Linux kernel's approach to huge pages is to mainly let system administrators manage them from userspace. This is mostly because as cool as they might sound, huge pages can also have drawbacks: for example, they cannot be swapped to disk. This LWN series on huge pages gives a lot of information on the topic.

By default there are no huge pages reserved, and one can reserve them at boot time through the boot parameters hugepagesz= and hugepages= (specified multiple times for multiple huge page sizes). Huge pages can also be reserved at runtime through /proc/sys/vm/nr_hugepages and /sys/kernel/mm/hugepages/hugepages-*/nr_hugepages. Furthermore, they can be "dynamically" reserved by the kernel if .../nr_overcommit_hugepages is set higher than .../nr_hugepages. These numbers are reflected in /proc/meminfo under the various HugePages_XXX stats, which are for the default huge page size (Hugepagesize).

File-backed mappings only support huge pages if the file resides in a hugetlbfs filesystem, and only of the specific size specified at mount time (mount option pagesize=). The hugeadm command-line tool, among other things, can give info about currently mounted hugetlbfs FSs with --list-all-mounts. One major reason for wanting a hugetlbfs mounted on your system is to enable huge page support in QEMU/libvirt guests.

All of the above covers "voluntary" huge pages allocations done with MAP_HUGETLB.

Linux also supports transparent huge pages (THP). Normal pages can be transparently made huge (or vice-versa existing transparent huge pages can be broken into normal pages) when needed by the kernel. This is without the need for MAP_HUGETLB, and regardless of nr_hugepages in sysfs.

There are some sysfs knobs to control THPs too. The most notable one being /sys/kernel/mm/transparent_hugepage/enabled: always means that the kernel will try to create THPs even without userspace programs actively suggesting it; madvise means that it will do so only if userspace programs suggests it through madvise(addr, len, MADV_HUGEPAGE); never means they are disabled. You'll probably see this set to always by default in modern Linux distros e.g. recent releases of Debian or Ubuntu.

As an example, doing mmap(0x123 << 21, 2*1024*1024, 7, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) with /sys/kernel/mm/transparent_hugepage/enabled set to always should result in a 2M transparent huge page since the requested mapping is aligned to 2M (notice the absence of MAP_HUGETLB).

Does it mean the kernel has no need in huge pages at all? What are some examples of scenarios where huge pages are a must?

In general, you don't really need huge pages of any kind, you can very well live without them. They are just an optimization. Scenarios where they can be useful are, as mentioned by @Mgetz in the comments above, cases where you have a lot of random memory accesses on very large files (common for databases). Minimizing TLB pressure in such cases can result in significant performance improvements.

Another important THP setting is `/sys/kernel/mm/transparent_hugepage/defrag`. A good choice for some general use cases (like a desktop) is I think `defer+madvise` or just `madvise`. Definitely not `always`, stopping to defrag physical address-space when programs didn't request it with `madvise` is only useful for specific workloads, harmful for others. (Unless you have `enabled=madvise` in which case it won't be trying to use it for most allocations.) — Peter Cordes, Jun 07 '22 at 06:24

score 1 · Answer 2 · answered Jun 07 '22 at 14:08

One place the kernel uses large pages is copying the map of kernel pages into the user process map. See pti_clone_kernel_text. It uses pmd size where it can (2MB) and pte (4K) for the rest. For a 10MB kernel, this means the kernel map takes only a small number of entries.

Why doesn't the Linux Kernel use huge pages?

2 Answers2