What's the advantage of x86's multi stage paging over a single page table?

Question

Consider x86's 32-bit paging scheme for a concrete example. From the Intel developer's manual I found the following figure, which described how 32-bit paging can convert a linear address to a physical address.

I don't understand the advantage of this three-stage process over, for example, most of the linear address being used to index a page, and then the lower 12 bits being used to index that page.

The reason I don't understand the need for the three-stage process is that, surely it can't somehow be able to access any more pages than 2^20, since it only has that amount of bits in the linear address (excluding the page offset). As well as not being able to access any more pages, I can't imagine that it would have better performance.

Without the stages, you would need one giant page table with a million entries, even if you have only a few hundred valid pages. A page table that big would require 4MB of memory, Which is a problem if your computer has only 4MB to begin with. There would be no memory left for anything else. — Raymond Chen, May 01 '20 at 02:39

Nate Eldredge · Accepted Answer · 2020-05-01T03:11:36.950

3

Keep in mind that this design came from the Intel 80386, released in 1985, and is essentially unchanged since then.

A page table entry is 4 bytes. If you need 2^20 page table entries, that's 4 MiB of memory for your page table alone. That might seem reasonable to you today, but by 1985 standards that's outrageous. Memory in those days cost something like $1000 per megabyte, and most people were used to having 640K or less.

Furthermore, if you're going to use multitasking, which was the major advance of the 386, you need a separate page table for every task. So multiply that $4000 times another big number; Unix users would already have been used to being able to run dozens of processes at a time.

Most processes wouldn't have needed anywhere near 4 GiB of virtual memory: again, hardly anyone had anywhere near that much physical memory, or even disk space. A real memory hog (Emacs, maybe?) might have needed a megabyte or two. With the two-level structure, you only need about as many page table entries as the number of pages of memory you're actually using, plus a bit more for the page directory. Most of the page directory entries won't be needed and can be marked unused, without needing a page table page to point to. So your memory hog now only needs a few KiB for paging overhead.

Sure, it takes a few more cycles to walk the extra level, but it's a reasonable tradeoff for thousands of dollars worth of memory savings. Anyway, the CPU cached page table entries, so it didn't have to walk them in memory very often.

edited May 01 '20 at 03:11

answered May 01 '20 at 03:00

Nate Eldredge

48,811
6
54
82

1

*That might seem reasonable to you today,* - 4 MiB of page tables *per process*? Requiring *physically contiguous* memory? Nope, that's still unreasonable in size, and the physically-contiguous requirement could need defragmentation or special allocator / free list for hugepages. – Peter Cordes May 01 '20 at 03:09
@PeterCordes: Thanks, fixed the 3 -> 2 levels. And sure, 4 MiB is unreasonable, but Kids These Days might not balk. – Nate Eldredge May 01 '20 at 03:12
Also worth mentioning that CPUs can (and do) cache page *directory* entries to speed up future page-walks in the same sub-tree. More valuable for x86-64's 4-level structure ([4x 9-bit levels with 8-byte PTEs](https://stackoverflow.com/questions/46509152/why-in-64bit-the-virtual-address-are-4-bits-short-48bit-long-compared-with-the) instead of legacy 32-bit 4-byte PTEs for 2x 10 bit levels). [What is PDE cache?](//stackoverflow.com/q/26945448). And of course a HW page walker can fetch page-table data through L2 or L1d cache and get cache hits, making repeated TLB misses suck slightly less. – Peter Cordes May 01 '20 at 03:18
And that hugepages let you use what's normally a PDE as a PTE that covers the whole range of what would normally be a subtree: 4MiB for 32-bit legacy page tables, 1G or 2MiB for PAE / x86-64 page tables. – Peter Cordes May 01 '20 at 03:20
2

I don't think the year 1985 is the thing here, multiple levels are how you achieve *sparsity* of what would be a very long array. Then and now. A single level would scale up very badly as the address width increases. – Margaret Bloom May 01 '20 at 07:29
1

@MargaretBloom: I think it's fair to say that sparse page tables were an absolute requirement in 1985. In 2020 it would merely be a bad design for a 4GiB address space (that any CPU architect would still reject). It could still be used in practice if someone did build a CPU that way, not a total showstopper. Scalability to wider address widths isn't a great argument; you'll changes anyway for wider physical and/or virtual addresses (e.g. 8-byte PTEs instead of 4) so *if* a flat page table had been good for 386 specifically, Intel might well have done that. – Peter Cordes May 02 '20 at 06:20
1

@MargaretBloom: I just had the horrible idea that if Intel had used a flat page table, OSes could set segment limits to catch accesses to unused high parts of virtual address space, so that part of the page table could be used for something else. This has the obvious downside of limiting virtual address space layout, and of having to use segmentation. But this is totally the kind of engineering around a bad design that could have happened if the architects of 386 hadn't been sane. – Peter Cordes May 02 '20 at 06:23

What's the advantage of x86's multi stage paging over a single page table?

1 Answers1