Does the Meltdown mitigation, in combination with `calloc()`s CoW "lazy allocation", imply a performance hit for calloc()-allocated memory?

Question

So calloc() works by asking the OS for some virtual memory. The OS is working in cahoots with the MMU, and cleverly responds with a virtual memory address which actually maps to a copy-on-write, read-only page full of zeroes. When a program tries to write to anywhere in that page, a page fault occurs (because you cannot write to read-only pages), a copy of the page is created, and your program's virtual memory is mapped to this brand new copy of those zeroes.

Now that Meltdown is a thing, OSes have been patched so that it's no longer possible to speculatively execute across the kernel-user boundary. This means that whenever user code calls kernel code, it effectively causes a pipeline stall. Typically, when the pipeline stalls in a loop, it's devastating for performance, since the CPU ends up wasting time waiting for data, whether from cache or main memory.

Given such, what I want to know is:

When a program writes to a never-before-accessed page which was allocated with calloc(), and the remapping to the new CoW page occurs, is this executing kernel code?
Is the page fault copy-on-write functionality implemented at the OS level or the MMU level?
If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?
And finally, if it is real, is there any case where this effect is significant to real-world performance?

First question: Yes, a page fault occurs on first access, which is handled by the kernel. Second question: CoW is implemented at the OS level. Windows uses a bit in the PTE to mark CoW pages and Linux uses a bit in a page descriptor structure maintained by the OS. Third question: I think it depends on the mitigation. Fourth question: Needs measuring. — Hadi Brais, May 05 '18 at 23:13
Speculative execution across the kernel/user boundary was never possible; Intel CPUs don't rename the privilege level, i.e. kernel/user transitions always required a full pipeline flush. I think you're misunderstanding Meltdown: it's cause purely by speculative execution in user-space and [delayed handling of the privilege checks on TLB hits](https://security.stackexchange.com/questions/177100/why-are-amd-processors-not-less-vulnerable-to-meltdown-and-spectre/177101#177101). (AFAIK, no other uarches rename the privilege level or otherwise speculate into kernel code.) — Peter Cordes, May 05 '18 at 23:21
@PeterCordes I'm a little confused. I'm thinking whether there is a CPU that can speculatively execute an exception or fault handler (in kernel mode) when an instruction faults but not yet retired. Perhaps maybe only prefetching the instructions (and decoding them) but not executing them. But that is not a security issue. — Hadi Brais, May 05 '18 at 23:26
@HadiBrais: CPUs don't predict page faults, so it doesn't matter in this case anyway; prefetch or decode of the page fault entry point could maybe happen while the pipeline was flushing, but it wouldn't start until the page-faulting instruction tried to retire. A faulting load/store is marked to take effect on retirement, and doesn't re-steer the front-end; the whole key to Meltdown is the lack of action on a faulting load until it reaches retirement. But anyway, maybe for `syscall` it might prefetch the kernel entry point, but definitely flushes the pipeline before running any kernel insns. — Peter Cordes, May 05 '18 at 23:46
Note that there is no copy-on-write behavior really with your scenarios where the first access is a write. If your first access is a write, the zero page never comes into it and there is no copying: before the write, the page isn't mapped at all, and the write fault immediately allocates a new private page. Only read faults may result in pages all pointing to the zero page. This doesn't really invalidate your question, only some of your detailed description. — BeeOnRope, May 06 '18 at 04:56

Peter Cordes · Accepted Answer · 2018-05-07T01:23:35.323

Your premise is wrong. Page faults were never pipelined / super-cheap. Meltdown (and Spectre) mitigation does make them more expensive, though, along with system calls and all other user->kernel transitions.

Speculative execution across the kernel/user boundary was never possible; Intel CPUs don't rename the privilege level, i.e. kernel/user transitions always required a full pipeline flush. I think you're misunderstanding Meltdown: it's cause purely by speculative execution in user-space and delayed handling of the privilege checks on TLB hits.

This is universal in CPU design, AFAIK. I'm not aware of any microarchitectures that rename the privilege level or otherwise speculate into kernel code, x86 or otherwise.

The cost added by Meltdown mitigation is that entering the kernel flushes the TLB. (Or on CPUs with TLB process-context ID support, the kernel can use PCIDs to make using separate page-tables for kernel vs. user-space much cheaper).

The kernel entry point (on Linux) becomes a trampoline that swaps page tables and jumps to the real kernel entry point, to avoid exposing the kernel ASLR offset to user-space. But other than that and an extra mov cr3, reg on entry and exit from the kernel (setting a new page table), nothing else is changed.

(Spectre mitigation is tricky, too, and required more changes like retpolines... and might also significantly increase the cost of user->kernel->user. IDK about page fault costs.)

@BeeOnRope reports (see comments and his answer for full details) that without Spectre patches, just Meltdown patches applied but nopti boot option to "disable" it, increased the cost of a round trip to the kernel on a Skylake CPU (with syscall with bogus RAX, returning -ENOSYS right away) went up from ~100 to ~300 cycles. So that's maybe the cost of the trampoline? And with actual page-table isolation enabled, it went up to ~700 cycles. That's without Spectre mitigation patches at all. (Also, that's the x86-64 syscall entry point, not page-fault. They're likely similar, though.)

Page fault exceptions:

CPUs don't predict page faults, so they couldn't speculatively execute the handler anyway. Prefetch or decode of the page fault entry point could maybe happen while the pipeline was flushing, but that process wouldn't start until the page-faulting instruction tried to retire. A faulting load/store is marked to take effect on retirement, and doesn't re-steer the front-end; the whole key to Meltdown is the lack of action on a faulting load until it reaches retirement.

Related: When an interrupt occurs, what happens to instructions in the pipeline?

Also: Out-of-order execution vs. speculative execution has some detail about what kind of speculation really causes Meltdown, and how CPUs handle faults.

When a program writes to a never-before-accessed page which was allocated with calloc(), and the remapping to the new CoW page occurs, is this executing kernel code?

Yes, page faults are handled by the kernel's page-fault handler. There's no pure-hardware handling for copy-on-write.

If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?

Yes. The kernel doesn't fault-around for zeroed pages (unlike for file-backed mappings when data is hot in the pagecache). So every new page touched causes a pagefault, even for small 4k normal pages. (Thanks to @BeeOnRope for accurate info on this.) With anonymous hugepages, you'll only pagefault once per 2MiB (x86-64), which is tremendously better.

If you want to avoid per-page costs, allocate with mmap(MAP_POPULATE) to prefault all the pages into the HW page table, on a Linux system. I'm not sure if madvise can prefault pages for you, e.g. madvise(MADV_WILLNEED) on an already-mapped region. But madvise(MADV_HUGEPAGE) will encourage the kernel to use anonymous hugepages (and maybe to defrag physical memory to free up contiguous 2M blocks to enable that, if you don't have it configured to do that without madvise).

Related: Two TLB-miss per mmap/access/munmap has some perf results on a Linux kernel with KPTI patches.

BTW, I measured the cost of the Meltdown mitigations (before Spectre mitigations were released) and the cost was significant even if it was disabled at boot time with `nopti` - IIRC the minimum cost went from just over 100 cycles to about 300. With Meltdown enabled, it was closer to 700 cycles. Looking at the entry code and `perf` reports, the entry point got a bunch more complicated which I guess accounts for the extra cost. — BeeOnRope, May 06 '18 at 04:35
@BeeOnRope: Thanks, fixed / updated. That was the x86-64 `syscall` entry point you were looking at, right, not the page-fault handler? [Fastest Linux system call](//stackoverflow.com/q/48913091). I wonder if the extra 200c even with `nopti` was a dependent cache miss for a pointer or something. Any chance it's cheaper now with better Meltdown patches that take advantage of process-context IDs to avoid a full TLB flush? Your early testing might have been a safe-but-slow version of the patches. — Peter Cordes, May 06 '18 at 12:40
Yes, I was only looking at system call costs, and making the (perhaps invalid) assumption that the Meltdown related costs would be similar for the page-fault and syscall cases (if the primary cost is the PTI manipulation/CR3 write). I actually ran my `syscall` benchmark again and the [results are here](https://gist.github.com/travisdowns/0f5c0d1139d87e8fa3b853396dd01a9a). The syscall overhead is right around 650 cycles for a non-existend syscall, and about 720 cycles minimum for very trivial syscalls like `getuid`. — BeeOnRope, May 06 '18 at 19:00
When I disabled KPTI (Meltdown) and Spectre mitigations with `nopti` and `spectre_v2=off`, respectively, the times shot up to more than 1,700 cycles for any syscall. So something is broken, performance-wise, with the boot-time disablement, at least after the Spectre patches (I didn't see this before when I looked at Meltdown only). This is kernel `4.13.0-39-generic`. — BeeOnRope, May 06 '18 at 19:02
I update the gist to include results from `4.10.0-42` which is before any Meltdown/Spectre stuff. The results were as I remember them: as low as 110 cycles syscall overhead. So we are looking at at least about a 5 or 6 times overhead for syscalls with the mitigations on my kernel, and (oddly) a 15x or so penalty on my kernel if you try to disable such mitigations. — BeeOnRope, May 06 '18 at 19:14
BTW, faultaround doesn't help for new anon pages write access. It won't "fault around" nearby pages when you access such a page. Faultaround is for pages in the page cache which already exist in RAM, so faultaround only needs to add their mapping to the process page table. For anonymous pages, they never "already exist in RAM" - they need to be allocated and zeroed and so faultaround doesn't kick in and you don't see any 16x reduction in page faults. — BeeOnRope, May 06 '18 at 20:48
Consider that if fault-around _did_ behave in that way (bringing in extra anonymous pages), it would increase the working set of any process that accessed pages in a very sparse manner by 16x, which would be a big regression on some workloads. — BeeOnRope, May 06 '18 at 21:00
I went ahead and actually tested all three configurations (old kernel w/o mitigation code at all, new kernel with mitigations on, and new with mitigations disabled at boot). The results were consistent with the above: you see about a 450 cycle regression in page fault time, roughly consistent with the absolute different in syscall time (which was a bit more than 500 cycles), leading to an overall regression of about 14% on the old vs new kernels. The new kernel with mitigations disabled was much slower that either, which is really weird. Details in my answer. — BeeOnRope, May 06 '18 at 21:18
@BeeOnRope: The kernel *could* detect the access pattern if it kept any history, the same way HW prefetch notices sequential access. But faultaround would be the wrong term for that, so good point. Keeping a pool of already-zeroed pages used to be a thing, but maybe isn't anymore now that L1d cache is so much faster than RAM. But if it did have extra zeroed physical pages, the kernel could speculatively wire already-zeroed pages, and reclaim them later if they were never dirtied. (e.g. when handling a later pagefault for the same process, but it would still have to invd TLB entries to unmap) — Peter Cordes, May 07 '18 at 01:14
The answer a question you asked near the end of your post, no `madvise(MADV_WILLNEED)` doesn't fault in anonymous pages. I haven't found a good way to fault in anon pages ahead of time, but asked [about it here](https://stackoverflow.com/q/56411164/149138). — BeeOnRope, Jun 06 '19 at 16:27

BeeOnRope · Answer 2 · 2018-05-07T17:28:14.460

Yes use of calloc()-allocated memory will suffer a performance degradation due to the Meltdown and Spectre patches.

In fact, calloc() isn't special here: malloc(), new and more generally all allocated memory will probably suffer approximately the same performance impact. Both calloc() and malloc() are ultimately backed by pages returned by the OS (although the allocator will re-use them after they are freed). The only real difference being that a smart allocator, when it goes down the path of using new pages from the OS (rather than re-using a previously freed allocation) in the case of calloc it can omit the zeroing because the OS-provided pages are guaranteed to be zero. Other than that the allocator behavior is largely the same and the OS-level zeroing behavior is the same (there is usually no option to ask the OS for non-zero pages).

So the performance impact applies more broadly than you thought, but the performance impact is likely smaller than you suggest, since a page fault is already doing a lot of work anyways, so you aren't talking an order of magnitude degradation or anything. See Peter's answer on the reasons the performance impact is likely to be limited. I wrote this answer mostly because the answer to your headline question is still yes as there is some impact.

To estimate the impact on a malloc heavy workflow, I tried running some allocation and page-fault heavy test on a current kernel (4.13.0-39-generic) with the Spectre and Meltdown mitigations, as well as on an older kernel prior to these mitigations.

The test code is very simple:

#include <stdlib.h>
#include <stdio.h>

#define SIZE        (40 * 1024 * 1024)
#define PG_SIZE     4096

int main() {
    char *mem = malloc(SIZE);
    for (volatile char *p = mem; p < mem + SIZE; p += PG_SIZE) {
        *p = 'z';
    }
    printf("pages touched: %d\npoitner value : %p\n", SIZE / PG_SIZE, mem);
}

The results on the newer kernel were about ~3700 cycles per page fault, and on the older kernel without mitigations around ~3300 cycles. The overall regression (presumably) due to the mitigations was about 14%. Note that this in on Skylake hardware (i7-6700HQ) where some of the Spectre mitigations are somewhat cheaper, and the kernel supports PCID which makes the KPTI Meltdown mitigations cheaper. The results might be worse on different hardware.

Oddly, the results on the new kernel with Spectre and Meltdown mitigations disabled at boot (using spectre_v2=off nopti) were much worse than either the new kernel default or the old kernel, coming in at about 5050 cycles per page fault, something like a 35% regression over the same kernel with the mitigations enabled. So something is going really wrong, performance-wise when the mitigations are disabled.

Full Results

Here is the full perf stat output for the two runs.

Old Kernel (4.10.0-42)

pages touched: 10240
poitner value : 0x7f7d2561e010

 Performance counter stats for './pagefaults':

         12.980048      task-clock (msec)         #    0.976 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            10,286      page-faults               #    0.792 M/sec                  
        33,662,397      cycles                    #    2.593 GHz                    
        27,230,864      instructions              #    0.81  insn per cycle         
         4,535,443      branches                  #  349.417 M/sec                  
            11,760      branch-misses             #    0.26% of all branches        

0.013293417 seconds time elapsed

New Kernel (4.13.0-39)

pages touched: 10240
poitner value : 0x7f306ad69010

 Performance counter stats for './pagefaults':

         14.789615      task-clock (msec)         #    0.966 CPUs utilized          
                 8      context-switches          #    0.541 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            10,288      page-faults               #    0.696 M/sec                  
        38,318,595      cycles                    #    2.591 GHz                    
        28,796,523      instructions              #    0.75  insn per cycle         
         4,693,944      branches                  #  317.381 M/sec                  
            26,853      branch-misses             #    0.57% of all branches        

       0.015312764 seconds time elapsed

New Kernel (4.13.0.-39) spectre_v2=off nopti

pages touched: 10240
poitner value : 0x7ff079ede010

 Performance counter stats for './pagefaults':

         16.690621      task-clock (msec)         #    0.982 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            10,286      page-faults               #    0.616 M/sec                  
        51,964,080      cycles                    #    3.113 GHz                    
        28,602,441      instructions              #    0.55  insn per cycle         
         4,699,608      branches                  #  281.572 M/sec                  
            25,064      branch-misses             #    0.53% of all branches        

       0.017001581 seconds time elapsed

`poitner` is a typo in your test program >.< And for future readers, your test system is a Skylake i7-6xxxHQ, IIRC. — Peter Cordes, May 07 '18 at 01:25