Page directories in modern hierarchical paging designs contain physical addresses of page tables. For example, https://wiki.osdev.org/Paging shows the format for 32-bit x86: the high 20 bits of a PDE are page-aligned physical address, the low 12 bits include flags for valid entries. Basically the same as PTE entries that point to a final physical page.
But when the P
resent bit is cleared, rest of the entry can be used to hold whatever the OS wants, e.g. a virtual address or an offset into swap space. The page-walk hardware just stops after seeing it's not Present, so the page walk fails. (TLB miss becomes a #PF
page fault for the OS to handle somehow, if it wasn't the result of mis-speculation.)
It's possible to design a paging architecture where some of the entries (but not all) of a page directory contain virtual addresses of page tables. For example, paging in VAX can be viewed as having two levels where the three register pairs that contain addresses and lengths of page tables can be collectively considered as a page directory. Two of these entries contain virtual addresses and the third one contain a physical address. The virtually addressed page tables can be resolved using the physically address page table.
The entry that contains a physical address holds translations for the kernel partition of the virtual address space and can only be accessed in kernel mode. The entries that contain virtual addresses are per process and need to change on every context switch. This is in contrast to the x86 paging design where there each process has its own page directory and kernel mappings have to be replicated in each of them. That said, only a single register (CR3) has to change on every context switch in x86. In addition, the tiny "page directory" in VAX makes the memory overhead of page tables better than a flat page time but still not as good as the balanced hierarchical page table used in x86. Another trade-off is that all virtual addresses in 32-bit x86 have to go through two levels of in-memory translation for 4KB pages while kernel virtual addresses in VAX require only a single level of in-memory translation since the first level is held in few registers. This is not feasible for the relatively large page directories in x86.
But in order to use multi-level paging to save storage we need to be able to page-out page tables to disk.
In general, making page tables pageable is not very important to reduce the memory overhead of paging structures. Virtual memory allocators should (and do) try to allocate memory as densely as possible starting from the lowest available levels of the page table, with many of the entries in higher level page directory not "present" at all, instead of pointing to a whole table of 1024 entries (4kiB) that have their Present bit cleared. It may happen though over a long period of allocations and deallocations that the page table becomes sparsely populated in the inner most level but densely populated in the other levels. This is where paging out page tables can be useful, especially on platforms where the total amount of available or remaining physical memory is not ample. On Windows, page tables are pageable at all levels except the top one. Linux page tables are not pageable.
(Or use hugepages: the page directory entry is actually a translation for 4MiB of virtual memory to a contiguous 4MiB of physical memory. So again you avoid 4kiB of page table space but have that virtual memory mapped instead of not-present.)
Paging out the page tables can be done on x86, using the Present bit on a PDE and using the rest of it as metadata to distinguish between not supposed to be mapped vs. paged out.
On x86 you'd have to have your page-fault handler manually walk the page tables (or look up in some other structure) to find out if it was truly an invalid page fault, or if you reached a not-present page-directory entry that logically should be present, and need to page in part of the page table and let the task retry its memory access.
(And then that could result in another page fault to page in the actual data from 4k page.) Note that hardware page walks are significantly cheaper than page faults, and can happen speculatively "in the background" during out-of-order execution.
Grouping allocations into the same 4M region (or on x86-64, 2M or 1G) as existing allocations avoids needing a new 4k page of PTEs, instead just changing some existing PTEs to Present.
(x86-64 uses 4-level page tables, with 9 bits per level 4*9 + 12 = 48 bit virtual. classic 32-bit x86 uses 2-level with 10 bits per level: 2*10 + 12 = 32 bit virtual.)
Multi-level page tables were introduced to alleviate the memory overhead of single-level page tables, which grows linearly in the size of the virtual address space (multiplied by the logarithm of the size of the physical address space since each page table entry stores the physical address of the page). In a single-level page table design, by definition, the single page table cannot partially exist in physical memory. Even with a multi-level page table, allocations may end up being sparsely distributed in the inner most level of the page table. In this case, the total amount of physical memory consumed by the page table may constitute a large portion of all physical memory consumed by the processor (or even the whole system). That's why making page tables pageable can be useful.
Kernel mappings that are the same across all processes can share the same physical pages for their page tables, pointed-to by the high half of the PDEs in every page directory. (They'll also have the Global bit set so they're not invalidated when changing CR3, so you might not even need a page-walk for them. But if a page walk is needed to resolve a TLB miss then that same set of PTEs can stay hot in data caches. Does page walk take advantage of shared tables?)