Why modifying an instruction cause huge i-cache and i-TLB misses on x86?

Question

The following code fragment creates a function (fun) with just one RET instruction. The loop repeatedly calls the function and overwrites the contents of the RET instruction after returning.

#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>

typedef void (*foo)();
#define RET (0xC3)

int main(){
     // Allocate an executable page
    char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
    // Just write a RET instruction
    *ins = RET;
    // make fun point to the function with just RET instruction
    foo fun = (foo)(ins);
    // Repeat 0xfffffff times
    for(long i = 0; i < 0xfffffff; i++){
        fun();
        *ins = RET;
    }
    return 0;
}

The Linux perf on X86 Broadwell machine has the following icache and iTLB statistics:

perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out

Performance counter stats for './a.out':

   805,516,067      L1-icache-load-misses                                       
         4,857      iTLB-load-misses                                            

  32.052301220 seconds time elapsed

Now, look at the same code without overwriting the RET instruction.

#include <sys/mman.h>
#include<stdlib.h>
#include<unistd.h>
#include <string.h>

typedef void (*foo)();
#define RET (0xC3)

int main(){
    // Allocate an executable page
    char * ins = (char *) mmap(0, 4096, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
    // Just write a RET instruction
    *ins = RET;
    // make fun point to the function with just RET instruction
    foo fun = (foo)(ins);
    // Repeat 0xfffffff times
    for(long i = 0; i < 0xfffffff; i++){
        fun();
        // Commented *ins = RET;
    }
    return 0;
}

And here is the perf statistics on the same machine.

perf stat -e L1-icache-load-misses -e iTLB-load-misses ./a.out

Performance counter stats for './a.out':

        11,738      L1-icache-load-misses                                       
           425      iTLB-load-misses                                            

   0.773433500 seconds time elapsed

Notice that overwriting the instruction causes L1-icache-load-misses to grow from 11,738 to 805,516,067 -- a manifold growth. Also notice that iTLB-load-misses grows from 425 to 4,857--quite a growth but less compared to L1-icache-load-misses. The running time grows from 0.773433500 seconds to 32.052301220 seconds -- a 41x growth!

It is unclear why the CPU should cause i-cache misses if the instruction footprint is so small. The only difference in the two examples is that the instruction is modified. Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?

Furthermore, why is there a 10x growth in the iTLB misses?

Stores don't go into the I-L1, so when the CPU detects SMC, it invalidates the L1 line, flush the pipeline and restart the fetching, causing a miss. At least that's what I believe. The iTLB count may be due to some aliasing-avoid mechanism since there is also an identical dTLB entry. But again, I don't know for sure. — Margaret Bloom, Sep 16 '18 at 17:23
To learn more about how real Intel CPUs handle self-modifying code (with a pipeline nuke), see [Observing stale instruction fetching on x86 with self-modifying code](https://stackoverflow.com/q/17395557). @MargaretBloom: I tested, and we do get counts for `machine_clears.smc` even with `mfence` + `lfence` after the store on Skylake. I was hoping that would stop speculation into the code in another page until after the store could evict the uop-cache and L1i entries. — Peter Cordes, Sep 16 '18 at 20:20

score 2 · Answer 1 · answered Sep 16 '18 at 17:31

2

Granted the L1 iCache and dCache are separate, isn't there a way to install code into iCache so that the cache i-cache misses can be avoided?

No.

If you want to modify code - the only path this can go is the following:

Store Date Execution Engine
Store Buffer & Forwarding
L1 Data Cache
Unified L2 Cache
L1 Instruction Cache

Note that you are also missing out on the μOP Cache.

This is illustrated by this diagram¹, which I believe is sufficiently accurate.

I would suspect the iTLB misses could be due to regular TLB flushes. In case of no modification you are not affected by iTLB misses because your instructions actually come from the μOP Cache.

If they don't, I'm not quite sure. I would think the L1 Instruction Cache is virtually addressed, so no need to access the TLB if there is a hit.

^{1: unfortunately the image has a very restrictive copyright, so I refrain from highlighting the path / inlining the image.}

answered Sep 16 '18 at 17:31

Zulan

21,896
6
49
109

2

L1i is VIPT like L1d, but the *uop* cache is virtually addressed. An `invlpg` would have to flush the relevant uop-cache lines as well, but I guess just regular LRU eviction of an iTLB entry could leave the uop cache alone. (i.e. the uop cache could be considered part of the translation-caching in the CPU.) So yes, with uop-cache hits, there will be fewer iTLB accesses. – Peter Cordes Sep 16 '18 at 20:17
@Peter Cordes, what do you mean by "So yes, with uop-cache hits, there will be fewer iTLB accesses?". Do you mean "uop-cache hints"? if so, how to give hints? – user1205476 Sep 16 '18 at 23:41
1

Also, notice that the number of icache misses is about 3x more that the loop trip count. 805,516,067 / 0xfffffff = 3.00078. What explains this 3x more icache misses? – user1205476 Sep 16 '18 at 23:43
@user1205476: no, I mean if code-fetch has a hit in the decoded-uop cache, it doesn't need to touch the iTLB or L1i cache. Actually, this loop is few enough uops that it probably fits in the loop buffer. Read https://www.realworldtech.com/sandy-bridge/4/, and Agner Fog's microarch guide (https://agner.org/optimize/) to learn what the uop cache is. – Peter Cordes Sep 17 '18 at 00:45
1

And BTW, just running for 41x longer means that 41x more interrupt handlers run for the same number of loop iterations, so you might be getting more iTLB misses from interrupt handlers (especially with Meltdown mitigation in the kernel). IDK if that's plausible; I didn't do the numbers for a simple loop that runs that long *without* doing SMC stuff. Or maybe you can profile `iTLB-load-misses:u` to only count ones that happen in user-space? – Peter Cordes Sep 17 '18 at 00:49
@user1205476: Interesting observation about ~3 icache misses per loop iteration. I'm not sure how to explain that. If L1i cache is flushed entirely on SMC, then the line(s) holding the caller loop would also be evicted. – Peter Cordes Sep 17 '18 at 00:52
1

@PeterCordes Yes most ITLB misses occur in kernel mode and there is a significant standard deviation. SMC machine clears do not flush the TLBs and only the L1I gets flushed. The SMC code runs at 345 cycles per iteration (6 instructions) while non-SMC code runs at 5 cycles per iteration (5 instructions). The only reason I didn't post an answer is I'm stuck at interpreting the 3 icache misses per iteration. I expected 2, one for the line containing the loop body and one for the line containing `fun`. But apparently there is a third miss. – Hadi Brais Sep 17 '18 at 02:10
@ Hadi Brais, I do not get your point about 2 cache misses. Why should there be any icache miss for the line containing fun? Just the function body should incur an icache miss. – user1205476 Sep 17 '18 at 02:35
1

@Peter Cordes, you are correct about iTLB misses coming from the interrupts. See the perf stat below with iTLB-load-misses:u ``` perf stat -e L1-icache-load-misses -e iTLB-load-misses:u ./a.out Performance counter stats for './a.out': 805,517,291 L1-icache-load-misses 307 iTLB-load-misses 31.962878826 seconds time elapsed ```. Sorry for the formatting; there are only 307 iTLB-load-misses in the user space. – user1205476 Sep 17 '18 at 02:46
1

@PeterCordes I've changed the code so that both the `fun` function and the whole loop body are in the same cache line, and still getting 3 icache misses per iteration. Very weird. – Hadi Brais Sep 17 '18 at 06:03

Why modifying an instruction cause huge i-cache and i-TLB misses on x86?

1 Answers1