details about page table walks in multi-level paging

Question

For the sake of simplicity I will use a concrete example to explain my question. Note that I'm interested in the underlying concepts.

Consider a 2-level paging scheme i.e. we have one page directory table and multiple page tables. In order to translate a virtual address to a physical one we first use the n most significant bits of our virtual address to index the page directory table. From there we obtain the address of a page table. Using the remaining bits of our virtual address we index this page table to find the physical address of our data.

Now I was wondering about the following: Does/Should the page directory table store the virtual or the physical addresses of page tables?

If it stores physical addresses of page tables then that means that we cannot page-out any page tables to disk. But in order to use multi-level paging to save storage we need to be able to page-out page tables to disk. So then we use virtual addresses in the page directory table. But this means that after indexing the page directory table we first have to translate the obtained virtual address which requires another page table walk (and in that page table walk we again have to translate a virtual address etc.). So this does not seem to work as well.

Clearly I missed something important. I am interested in hearing about the concepts behind this as well as actual implementations. Or let me know where I can find that information.

As you found out, using virtual addresses would lead to recursion. Using physical address **does** allow for "paging out" unneeded tables. In the x86 this is done with the P (Presence) bit. — Margaret Bloom, Jan 08 '20 at 09:23
So when the presence bit is not set for some particular entry in the page directory table, we know that the corresponding page table is not in physical memory. But how does the MMU know where on disk to look for that page table? — sebastian, Jan 08 '20 at 12:29
@MargaretBloom: Yes they could in theory, but AFAIK nobody does that. Multi-level page tables can be sparse, and that's plenty of savings. (4k directory + a couple 4k blocks of PTEs instead of one flat 1MiB array of PTEs.) @ araomis: if you did do that, you could use the other bits of the PDE as metadata or something, or you'd need a separate data structure to find the paged-out tables. — Peter Cordes, Jan 08 '20 at 15:23

score 4 · Accepted Answer · edited Jan 09 '20 at 01:40

Page directories in modern hierarchical paging designs contain physical addresses of page tables. For example, https://wiki.osdev.org/Paging shows the format for 32-bit x86: the high 20 bits of a PDE are page-aligned physical address, the low 12 bits include flags for valid entries. Basically the same as PTE entries that point to a final physical page.

But when the Present bit is cleared, rest of the entry can be used to hold whatever the OS wants, e.g. a virtual address or an offset into swap space. The page-walk hardware just stops after seeing it's not Present, so the page walk fails. (TLB miss becomes a #PF page fault for the OS to handle somehow, if it wasn't the result of mis-speculation.)

It's possible to design a paging architecture where some of the entries (but not all) of a page directory contain virtual addresses of page tables. For example, paging in VAX can be viewed as having two levels where the three register pairs that contain addresses and lengths of page tables can be collectively considered as a page directory. Two of these entries contain virtual addresses and the third one contain a physical address. The virtually addressed page tables can be resolved using the physically address page table.

The entry that contains a physical address holds translations for the kernel partition of the virtual address space and can only be accessed in kernel mode. The entries that contain virtual addresses are per process and need to change on every context switch. This is in contrast to the x86 paging design where there each process has its own page directory and kernel mappings have to be replicated in each of them. That said, only a single register (CR3) has to change on every context switch in x86. In addition, the tiny "page directory" in VAX makes the memory overhead of page tables better than a flat page time but still not as good as the balanced hierarchical page table used in x86. Another trade-off is that all virtual addresses in 32-bit x86 have to go through two levels of in-memory translation for 4KB pages while kernel virtual addresses in VAX require only a single level of in-memory translation since the first level is held in few registers. This is not feasible for the relatively large page directories in x86.

But in order to use multi-level paging to save storage we need to be able to page-out page tables to disk.

In general, making page tables pageable is not very important to reduce the memory overhead of paging structures. Virtual memory allocators should (and do) try to allocate memory as densely as possible starting from the lowest available levels of the page table, with many of the entries in higher level page directory not "present" at all, instead of pointing to a whole table of 1024 entries (4kiB) that have their Present bit cleared. It may happen though over a long period of allocations and deallocations that the page table becomes sparsely populated in the inner most level but densely populated in the other levels. This is where paging out page tables can be useful, especially on platforms where the total amount of available or remaining physical memory is not ample. On Windows, page tables are pageable at all levels except the top one. Linux page tables are not pageable.

(Or use hugepages: the page directory entry is actually a translation for 4MiB of virtual memory to a contiguous 4MiB of physical memory. So again you avoid 4kiB of page table space but have that virtual memory mapped instead of not-present.)

Paging out the page tables can be done on x86, using the Present bit on a PDE and using the rest of it as metadata to distinguish between not supposed to be mapped vs. paged out.

On x86 you'd have to have your page-fault handler manually walk the page tables (or look up in some other structure) to find out if it was truly an invalid page fault, or if you reached a not-present page-directory entry that logically should be present, and need to page in part of the page table and let the task retry its memory access.

(And then that could result in another page fault to page in the actual data from 4k page.) Note that hardware page walks are significantly cheaper than page faults, and can happen speculatively "in the background" during out-of-order execution.

Grouping allocations into the same 4M region (or on x86-64, 2M or 1G) as existing allocations avoids needing a new 4k page of PTEs, instead just changing some existing PTEs to Present.

(x86-64 uses 4-level page tables, with 9 bits per level 4*9 + 12 = 48 bit virtual. classic 32-bit x86 uses 2-level with 10 bits per level: 2*10 + 12 = 32 bit virtual.)

Multi-level page tables were introduced to alleviate the memory overhead of single-level page tables, which grows linearly in the size of the virtual address space (multiplied by the logarithm of the size of the physical address space since each page table entry stores the physical address of the page). In a single-level page table design, by definition, the single page table cannot partially exist in physical memory. Even with a multi-level page table, allocations may end up being sparsely distributed in the inner most level of the page table. In this case, the total amount of physical memory consumed by the page table may constitute a large portion of all physical memory consumed by the processor (or even the whole system). That's why making page tables pageable can be useful.

Kernel mappings that are the same across all processes can share the same physical pages for their page tables, pointed-to by the high half of the PDEs in every page directory. (They'll also have the Global bit set so they're not invalidated when changing CR3, so you might not even need a page-walk for them. But if a page walk is needed to resolve a TLB miss then that same set of PTEs can stay hot in data caches. Does page walk take advantage of shared tables?)

@HadiBrais: Thanks for the correction, I was basing this on knowledge of Linux which I'm pretty sure does *not* page the page tables. Linux never pages any kernel memory, only user-space, and I'm pretty sure page tables count as kernel, along with the kernel stack for every thread/task. Yes, sparse allocations can suck, especially if most of it gets paged out or is COW mapped to a few physical pages. (So avoid that when possible). Linux has sometimes optimized for bigger systems at the expense of worst-case performance on small systems, I think. — Peter Cordes, Jan 08 '20 at 22:23

details about page table walks in multi-level paging

1 Answers1