3

I've searched through a lot of resources, but found nothing concrete on the matter:

I know that with some linux systems, a fork() syscall works with copy-on-write; that is, the parent and the child share the same address space, but PTE is now marked read-only, to be used later of COW. when either tries to access a page, a PAGE_FAULT occur and the page is copied to another place, where it can be modified.

However, I cannot understand how the OS reaches the shared PTEs to mark them as "read". I have hypothesized that when a fork() syscall occurs, the OS preforms a "page walk" on the parent's page table and marks them as read-only - but I find no confirmation for this, or any information regarding the process.

Does anyone know how the pages come to be marked as read only? Will appreciate any help. Thanks!

ITz
  • 35
  • 6
  • Linus's OS "Linux" do walk on VMA inside fork syscall implementation: [`do_fork`](https://elixir.bootlin.com/linux/v3.16/source/kernel/fork.c#L1575) -> [`copy_process`](https://elixir.bootlin.com/linux/v3.16/source/kernel/fork.c#L1136) -> [`copy_mm`](https://elixir.bootlin.com/linux/v3.16/source/kernel/fork.c#L865) -> `dup_mm` -> `dup_mmap` ... Here I was unable to get exact line so the Internet Search Machine gives a hint for "fork+COW+dup_mm" as https://gist.github.com/cwshu/7d52bc993525c1bb7df1 - so real work is in `retval = copy_page_range(mm, oldmm, mpnt);` line - check mm/memory.c#L1005 – osgx Feb 24 '20 at 00:02

1 Answers1

9

Linux OS implements syscall fork with iterating over all memory ranges (mmaps, stack and heap) of parent process. Copying of that ranges (VMA - Virtual memory areas is in function copy_page_range (mn/memory.c) which has loop over page table entries:

    /*
     * If it's a COW mapping, write protect it both
     * in the parent and the child
     */
    if (is_cow_mapping(vm_flags)) {
        ptep_set_wrprotect(src_mm, addr, src_pte);
        pte = pte_wrprotect(pte);
    }

where is_cow_mapping will be true for private and potentially writable pages (bitfield flags is checked for shared and maywrite bits and should have only maywrite bit set)

#define VM_SHARED   0x00000008
#define VM_MAYWRITE 0x00000020

static inline bool is_cow_mapping(vm_flags_t flags)
{
    return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

PUD, PMD, and PTE are described in books like https://www.kernel.org/doc/gorman/html/understand/understand006.html and in articles like LWN 2005: "Four-level page tables merged".

How fork implementation calls copy_page_range:

  • fork syscall implementation (sys_fork? or syscall_define0(fork)) is do_fork (kernel/fork.c) which will call
  • copy_process which will call many copy_* functions, including
  • copy_mm which calls
  • dup_mm to allocate and fill new mm struct, where most work is done by
  • dup_mmap (still kernel/fork.c) which will check what was mmaped and how. (Here I was unable to get exact path to COW implementation so I used the Internet Search Machine with something like "fork+COW+dup_mm" to get hints like [1] or [2] or [3]). After checking mmap types there is retval = copy_page_range(mm, oldmm, mpnt); line to do real work.
Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
osgx
  • 90,338
  • 53
  • 357
  • 513
  • 3
    Nice detective job, although your links are pretty old. In latest kernel versions what happens is `copy_page_range()` → `copy_p4d_range()` → `copy_pud_range()` → `copy_pmd_range()` → `copy_pte_range()` → `copy_one_pte()`. You missed that intermediate [call to `copy_p4d_range()`](https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L1010) from `copy_page_range()`, which doesn't directly call `copy_pud_range()` anymore. – Marco Bonelli Feb 24 '20 at 01:03
  • Thank you both so much! Clearly explains all. One last question though, for clarification: where does the iteration begin from? It's mentioned that iteration starts via `copy_page_range` func. Looking at the code attached, I saw it works on other data structures (such as the TLB) too. Does the iteration on the page tables themselves start from the PTBR (hence, iteration acts as a "page-walk")? And one other thing I noticed from the code - It seems like the above function marks "active" on PTEs. Is it the same function that determines whether a page will be considered active or inactive? – ITz Feb 24 '20 at 11:23
  • @itaz it's explained in the answer, it's `dup_mmap()` that does the job and decides to call `copy_page_range()` – Marco Bonelli Feb 24 '20 at 14:50
  • @itaz, iteration begins in dup_mm -> dup_mmap, and it iterates not over hardware based page tables, but on software structures of linux kernel which describes memory regions of process. So it is not page walk, it is mmap walk. dup_mmap [iterates over mmap-ed regions](https://elixir.bootlin.com/linux/v3.16/source/kernel/fork.c#L385) where some mmaps can be shared or file based and other are private. And some pages already can be disabled for write due to COW process. Not sure where did you find active, copy_one_pte has branch for non present pages `(!pte_present(pte))` (e.g. swapped). – osgx Feb 25 '20 at 10:15