Confusion about different meanings of "HighMem" in Linux Kernel

Question

I'm trying to understand what "highmem" means but I've seen it used in two different ways and I want to know if one or both are correct.

The two definitions I've gleaned are:

Highmem refers to a specific situation in 32 bit systems, where the system could fit more than 4GB of RAM but 32 bits only allowed the kernel to address 4GB memory directly, so any memory above 4GB needed to use Physical Address Extensions (PAE) and was called "highmem". When I see this version of high memory discussed, usually it's mentioned that 64 bit systems no longer have this problem; they can address their physical memory fully so no notion of "highmem" is needed (see 1, 2, 3). My own 64-bit system doesn't show any highmem memory in /proc/zoneinfo or in free -ml.
Highmem is used to describe the virtual memory address space that is for user-space. This is in contrast with lowmem, the address space that is used for the kernel and is mapped into every user-space program's address space. Another way I've seen this phrased is with the names zone_highmem (highmem) and zone_normal (lowmem). For instance, in the 32-bit system "3/1" user/kernel memory split, the 3GB used for user space would be considered high memory (see 4, 5, 6, 7).

Is one definition more correct than another?

Are both correct but useful in different situations (i.e. definition 1 referring to physical memory, definition 2 referring to virtual memory)?

For the more-than-4GB physical, if you use PAE at all, you use it for *everything*, not *just* for that memory which a 32-bit CPU / legacy-mode OS can't address any other way. It's a page-table format that has wider PTEs to have room for more physical bits, and room for an exec permission bit separate from read and write. (Hence giving the "NX" feature). In fact, x86-64 [uses the same PTE format as PAE, but with more levels](https://stackoverflow.com/questions/46509152/why-in-x86-64-the-virtual-address-are-4-bits-shorter-than-physical-48-bits-vs) — Peter Cordes, Jun 22 '21 at 22:49
Related: https://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-about-pae/ quotes some Linux Torvalds mails about how much it sucks to have less virtual address space than physical, so you can't have everything mapped at once. (That might also be related to highmem in 32-bit kernels - memory that isn't normally mapped by the kernel? Or not, I forget how that works.) — Peter Cordes, Jun 22 '21 at 22:51
@PeterCordes, so those two links line up more with definition 1, rather than the more generic, "all user space is high mem" in definition 2. Linus mentioned his disgust for highmem a couple of times in that second link of yours. Interesting. I'm still curious where people got definition 2 from and if it's a valid use of "highmem". — wxz, Jun 23 '21 at 14:38
Yeah, I'm not familiar with definition 2. I upvoted to see if someone could clear this up. Although you might just be seeing people talk about backing user-space virtual pages with definition-1 highmem, i.e. memory outside the 1GB the kernel can keep permanently mapped. Because that's a good use for it; the kernel only has to access it when the task is the current one (read/write system calls invoke copy_to/from_user with the user-space address, reaching the highmem through the user page table entries), except when swapping out a process that isn't running. — Peter Cordes, Jun 23 '21 at 15:00
Oh, yeah looking at your links, I think that's all it is. Will post an answer shortly. — Peter Cordes, Jun 23 '21 at 15:02
@PeterCordes On a related note, why does Linus say "Even before PAE, the practical limit was around 1GB..." Why is the limit 1GB for 32 bit systems and not 4GB (2^32 = 4GB)? (apologies if I should make this a separate question, seems related to my understanding of highmem) — wxz, Jun 23 '21 at 15:17
@PeterCordes Are you still working on an answer? Thank you in advance — wxz, Jun 24 '21 at 16:36
On and off :P Eventually got back to it, including pointing out how Linus answered your last question in comments. — Peter Cordes, Jun 25 '21 at 04:08

Peter Cordes · Accepted Answer · 2021-06-27T16:51:53.197

2

I think your examples of usage 2 are actually (sometimes mangled) descriptions of usage 1, or its consequences. There is no separate meaning, it's all just things that follow from not having enough kernel virtual address space to keep all the physical memory mapped all the time.

(So with a 3:1 user:kernel split, you only have 1GiB of lowmem, the rest is highmem, even if you don't need to enable PAE paging to see all of it.)

This article https://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-about-pae quotes a Linux Torvalds rant about how much it sucks to have less virtual address space than physical (which is what PAE does), with highmem being the way Linux tries to get some use out of the memory it can't keep mapped.

PAE is a 32-bit x86 extension that switches the CPU to using an alternate page-table format with wider PTEs (the same one adopted by AMD64, including an exec permission bit, and room for up to 52-bit physical addresses, although the initial CPUs to support it only supported 36-bit physical addresses). If you use PAE at all, you use it for all your page tables.

A normal kernel using a high-half-kernel memory layout reserves the upper half of virtual address space for itself, even when user-space is running. Or to leave user-space more room, 32-bit Linux moved to a 3G:1G user:kernel split.

See for example modern x86-64 Linux's virtual memory map (Documentation/x86/x86_64/mm.txt), and note that it includes a 64TB direct mapping of all physical memory (using 1G hugepages), so given a physical address, the kernel can access it by adding the phys address to the start of that virtual address. (kmalloc reserves a range of addresses in this region without actually having to modify the page tables at all, just the bookkeeping).

The kernel also wants other mappings of the same pages, for vmalloc kernel memory allocations that are virtually contiguous but don't need to be physically contiguous. And of course for the kernel's static code/data, but that's relatively small.

This is the normal/good situation without any highmem, which also applies to 32-bit Linux on systems with significantly less than 1GiB of physical memory. This is why Linus says:

virtual space needs to be bigger than physical space. Not “as big”. Not “smaller”. It needs to be bigger, by a factor of at least two, and that’s quite frankly pushing it, and you’re much better off having a factor of ten or more.

This is why Linus later says "Even before PAE, the practical limit was around 1GB..." With a 3:1 split to leave 3GB of virtual address space for user-space, that only leaves 1GiB of virtual address space for the kernel, just enough to map most of the physical memory. Or with a 2:2 split, to map all of it and have room for vmalloc stuff.

Hopefully this answer sheds more light on the subject than Linus's amusing Anybody who doesn’t get that is a moron. End of discussion. (From context, he's actually aiming that insult at CPU architects who made PAE, not people learning about OSes, don't worry :P)

So what can the kernel do with highmem? It can use it to hold user-space virtual pages, because the per-process user-space page tables can refer to that memory without a problem.

Many of the times when the kernel has to access that memory are when the task is the current one, using a user pointer. e.g. read/write system calls invoke copy_to/from_user with the user-space address (copying to/from the pagecache for a file read/write), reaching the highmem through the user page table entries.

Unless the data isn't hot in pagecache, then the read will block while DMA from disk (or network for NFS or whatever) is queued up. But that will just bring file data into the pagecache, and I guess the copying into user-owned pages will happen after a context-switch back to the task with the suspended read call.

But what if the kernel wants to swap out some pages from a process that isn't running? DMA works on physical addresses, so it can probably calculate the right physical address, as long as it doesn't need to actually load any of that user-space data.

(But it's usually not that simple, IIRC: DMA devices in 32-bit systems may not support high physical addresses. So the kernel might actually need bounce buffers in lowmem... I concur with Linus: highmem sucked, and using a 64-bit kernel is obviously much better, even if you want to run a pure 32-bit user-space.)

Anything like zswap that compresses pages on the fly, or any driver that does need to copy data using the CPU, would need a virtual mapping of the page it was copying to/from.

Another problem is POSIX async I/O that lets the kernel complete I/O while the process isn't active (and thus its page table isn't in use). Copying from user-space to the pagecache / write buffer can happen right away if there's enough free space, but if not you'd want to let the kernel read pages when convenient. Especially for direct I/O (bypassing pagecache).

Brendan also points out that MMIO (and the VGA aperture) need virtual address space for the kernel to access them; often 128MiB, so your 1GiB of kernel virtual address space is 128MiB of I/O space, 896MiB of lowmem (permanently mapped memory).

The kernel needs lowmem for per-process things including kernel stacks for every task (aka thread), and for page tables themselves. (Because the kernel has to be able to read and modify the page tables for any process efficiently.) When Linux moved to using 8kiB kernel stacks, that meant that it had to find 2 contiguous pages (because they're allocated out of the direct-mapped region of address space). Fragmentation of lowmem apparently was a problem for some people unwisely running 32-bit kernels on big servers with tons of threads.

edited Jun 27 '21 at 16:51

answered Jun 25 '21 at 04:07

Peter Cordes

328,167
45
605
847

1

Note that (for a monolithic kernel) kernel space also has to include memory mapped IO for devices; so 1 GiB of kernel space might be "128 MiB for devices, 128 MiB that the kernel needs, and only 512 MiB left for the mapping of physical RAM". When Linus decided to map "all" RAM into kernel space; he was a beginner, computers only had 64 MiB of RAM, the kernel was a temporary stop-gap (until GNU finished Hurd), and nobody cared about security. It wasn't until later (after it was too late to fix because too much other code depended on it) that he started blaming others for his own mistake. – Brendan Jun 25 '21 at 06:25
@Brendan: 1024-128 = 896MiB, and IIRC it's normal to have 896MiB of lowmem RAM on a 3:1 split. I hadn't realized that other people considered this a poor design, but I've read that Linux did manage to do more with PAE than most other OSes, or at least than Windows, so maybe not that poor? I don't think anyone would make a serious argument against it being a good thing to have a larger virtual address space than physical. Linus's 10x larger seems excessive, although maybe he's thinking about having plenty for each user-space process to use, as well as the kernel? – Peter Cordes Jun 25 '21 at 06:42
For most cases; physical pages are either free (kernel has no reason to access), already mapped into the current user-space (kernel can access via. user-space), or are owned by the kernel (can be mapped into kernel space when allocated, same as user-space). This includes transitions - e.g. "map then access when allocating" or "access then map when freeing". The only case where "map all RAM into kernel space" has a measurable benefit is for copy on write (where kernel needs to access 2 pages at the same time and only one will be already mapped); but that's trivial to solve with... – Brendan Jun 25 '21 at 07:17
... a temporary mapping (ideally, temporarily mapping into a "per CPU" area of kernel space to avoid TLB shootdown), and becomes relatively insignificant. For comparison; "map all RAM into kernel space" is a massive security disaster (especially for monolithic kernels) because any vulnerability means an attacker can access anything in RAM (not just the unimportant stuff in kernel space, but the confidential data in user-space - encryption keys, passwords, etc); has major problems when kernel space isn't big enough; and causes significantly more problems for certain hardware features ... – Brendan Jun 25 '21 at 07:21
@Brendan: If you're doing zswap (CPU compression of memory pages), you'd want any task to be able to access dirty pages of *other* tasks, not using their page tables. That sounds like a corner case, but zswap is widely used these days (e.g. some distros set it up out of the box). There's also performance to consider. 1GiB hugepages cover a lot of RAM with few TLB entries so a TLB hit is not rare. But otherwise yeah, if you can DMA into any page, you don't need mappings for most of the pagecache, until/unless someone uses `read` on file data instead of user-space `mmap`... – Peter Cordes Jun 25 '21 at 07:23
... (e.g. the "RAM encryption" feature in recent AMD & Intel CPUs where you can't have the same page mapped as "encrypted" in one place an "unencrypted" in another and Linux is forced into "either all RAM is encrypted, or no RAM is encrypted" because of that unnecessary extra mapping); and also typically prevents the kernel from using powerful "virtual memory tricks" in kernel space. – Brendan Jun 25 '21 at 07:25
@Brendan: I haven't benchmarked `invlpg` and the costs of temporary mappings; maybe it's not much of a problem. Good point about per-CPU areas so the mapping can be just for this core, though. – Peter Cordes Jun 25 '21 at 07:25
Hrm - for zswap I'd just have a working buffer in kernel space - "compress from user-space to working buffer" and "decompress from working buffer to user-space". For retrieving pages from swap (because the current process needs them) it'd be fine. For sending pages to swap it might involve either extra virtual address space switches or temporary mapping. – Brendan Jun 25 '21 at 07:37
@Brendan: Decompressing some physical page on demand is no problem for writing the result into the process's address space, as you say. But you have to read the compressed data from somewhere, so that page needs to be mapped, too. If you aren't actually cramped for virtual address space, but just choosing not to map everything, then I guess maybe you'd have a region of virtual address space mapped to memory pages you sometimes use as zswap, or something like that. (Or maybe you can do better than re-inventing a partial direct-mapping region; my default thinking is Linux-centric.) – Peter Cordes Jun 25 '21 at 07:45
I'd be tempted to suggest that, if the system is under heavy CPU load (and kernel can't use idle time to ensure there's always "enough/some" free physical RAM) and the system is also under heavy memory pressure (needing to do lots of swapping); you're already so far "down Shitz creek without a paddle" that a few extra virtual address space switches and/or temporarily mappings will be the least of your concerns. ;) – Brendan Jun 25 '21 at 07:46
1

D'oh. For "INVLPG of per-CPU temporary mapping", you'd get a few potential cache misses when altering the page tables; then the INVLPG will be fast; then the TLB miss (caused by accessing a recently invalidated TLB entry) will be fast because the data the CPU needs is still in cache. The alternative (likely TLB miss with nothing cached when accessing the page in a "mapping of all RAM") isn't free either. – Brendan Jun 25 '21 at 07:52
Thank you so much! I have two follow up questions (let me know if either are better suited as a new post or if there's already a post about it on SO). 1) So any talk of "practical limit is 1GB" in 32 bit systems is arbitrary because the 3:1 split is an arbitrary choice, right? In theory, the true limit for 32 bit would be 4GB. – wxz Jun 25 '21 at 21:38
2) Since the kernel maps the entire physical memory into its address space, are there always at least two ways for the kernel to access a user space page? (i.e. either use the user-space mapping or use the physical address + offset in the kernel address space)? Another way of phrasing it, are there always at least double mappings of the same user-space memory, one in user-space, one in kernel? – wxz Jun 25 '21 at 21:40
1

@wxz: if a user page is in lowmem, then yes, always at least 2 ways for the kernel to access a user-space page. If it was allocated from highmem, the user-space page table might be the only mapping. – Peter Cordes Jun 25 '21 at 22:05
1

@wxz: Re: practical vs. theoretical limits: a 0:4 split is obviously unusable; user-space would not have anywhere it could be mapped. You could plausibly imagine a kernel optimized for running boatloads of tiny user-space processes, leaving only lets say 16MiB of virtual address space for user-space, and the rest for the kernel. Then a kernel designed the way Linux is could have *almost* 4GiB of phys RAM without any of it being highmem. (But you still need some I/O space, and that consumes kernel virt addr space, and a good motherboard won't shadow DRAM with that range of phys addrs.) – Peter Cordes Jun 25 '21 at 22:06
1

@wxz: With a 2:2 split, leaving only 2GiB as the max *virtual* size of a user-space process, including room for any guard pages between mappings, 1GiB of phys RAM would let the kernel map all of it and have lots left over. Even a 3:1 split leaves the kernel with only a small amount of highmem for 1GiB of phys RAM, not a big problem, although it still means everything has to be aware that highmem is possible. (Note that you don't want to cramp user-space: a nearly-full address space makes it slower to find places for an mmap to fit, and fragmentation might mean you can't do a 128MB alloc.) – Peter Cordes Jun 25 '21 at 22:10
re: "Unless the data isn't hot in pagecache, then the write will block...". Does this mean with event driven IO you are only reading/writing from items hot in the page cache? Or can they trigger before then? – Noah Jun 27 '21 at 16:38
@Noah: Pretty sure I meant to say "read will block", or maybe was thinking about the copy_to_user that would write into user-space. I wasn't considering POSIX async I/O; that's a good example of something that presumably needs to read/write user-space pages while the task isn't `current`. – Peter Cordes Jun 27 '21 at 16:48
Semi-related to high-half kernels: Linear Address Masking (LAM 48 or 57)'s control bits to apply it to only-user or only-kernel addresses (https://www.phoronix.com/scan.php?page=news_item&px=Intel-LAM-Linux-Kernel-May-2022) are only useful for kernels where bit #47 or bit #56 are set for kernel, clear for user-space. – Peter Cordes Jun 28 '22 at 01:43

score 1 · Answer 2 · answered Jun 25 '21 at 02:57

1

say for example that we have two 32-bit Linux with virtual address space of say 3:1 GB(user:kernel) of the 4 GB available virtual memory address space, and we have two machines which have 512 MB and 2 GB of physical memory respectively.

For the 512 MB physical memory machine the Linux kernel could directly map the whole physical memory in it's kernel space in lowmem virtual region at specific offset of PAGE_OFFSET. No problem at all.

But what about the the 2 GB physical RAM machine we have 2 GB of physical RAM that we want to map into our kernel virtual lowmem region but this is just too big, it just can't fit since max kernel virtual address space is just 1 GB as we said at the beginning. So, to solve this problem Linux directly maps a portion of physical RAM into in's lowmem region (I remember, it's 750 MB but it varies between different arch ) then it set's up a virtual mapping where it can use temporary virtual mappings of that remaining amount of RAM. This is the so-called high-memory region.

This sort of a hack is not needed anymore in 64-bit mode since now we can map 128 TB of memory in lowmem region directly

Finally, this region is completely unrelated to global variable high_memory which just tells you the kernel upper bound of lowmem region. this means that to know amount of RAM in your system calculate difference between PAGE_OFFSET and high_memory global variables to get RAM size in bytes.

Hope it's clear now.

answered Jun 25 '21 at 02:57

KMG

1,433
1
8
19

As I asked on Peter's answer, we could handle the 2GB example completely with no highmem if we had said 2:2 split instead of 3:1, right?. So basically, up to 4GB for 32 bits, having high mem is an arbitrary choice? Then after 4GB highmem is necessary because otherwise the kernel can't access all the physical memory at once. Is that correct? – wxz Jun 25 '21 at 21:46
Based on what I learned in the other comments, I think the answer to my question is, a 2:2 split would be close to covering all 2GB of RAM. More like 2.x:1.x to cover the 2GB physical mapping and to include all other kernel memory. – wxz Jun 25 '21 at 22:24
@wxz you are right if this was the case (2:2 split) but address space split up is decided dynamically and kernel developers saw that 3:1 for 32 bit x86 systems is good. Anyway I guess that this split will actually work without high memory region on arm devices since Linux on 32 bit arm will actually choose this split up but I'm maybe be wrong. Anyway I don't think that 3:1(```up to 4GB``` as you said) split will be made on any architecture since this will strongly affect user space badly so in your second example you are wrong. – KMG Jun 25 '21 at 22:32
Not that I'm wrong, just that what I'm saying could work in theory if you had no reason to have a user-space, but wouldn't be done in practice. – wxz Jun 25 '21 at 22:38
@wxz yes I guess you got it now. – KMG Jun 25 '21 at 22:47

Confusion about different meanings of "HighMem" in Linux Kernel

2 Answers2

Linked