Avoiding self-modifying code (SMC) machine clears when writing to executable memory

Question

I have run into a weird issue, where the CPU believes that I am modifying currently executed code, and repeatedly triggers self-modifying code (SMC) machine clears.

My (simplified) program does the following:

Allocate an executable buffer.
Copy a 64-byte payload to some position X in the buffer.
Call payload at position X.
Go back to 2.

...for 100'000'000 iterations.

main.c:

#include <stdint.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

extern void smc(void *bufferPtr, void *bufferEndPtr);

int main()
{
    const int BUFFER_LENGTH = 4096;
    
    void *bufferPtr = mmap(0, BUFFER_LENGTH, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    void *bufferEndPtr = bufferPtr + BUFFER_LENGTH;
    printf("Instruction block buffer: %p, %s\n", bufferPtr, strerror(errno));
    
    smc(bufferPtr, bufferEndPtr);
    
    return 0;
}

smc.asm:

[section .text]

align 64
payload:
    ret

%define BUFFER_STEP 64

align 64
[global smc]
; rdi: bufferPtr
; rsi: bufferEndPtr
smc:
    push r10
    push r11
    push r12

    mov rax, 100_000_000
    mov r10, rdi ; r10 points to begin of buffer
    mov r11, rdi ; r11 points to current buffer position
    mov r12, rsi ; r12 points to end of buffer

.loop:
    ; Done?
    dec rax
    je .end
    
    mov rcx, 64
    mov rdi, r11
    lea rsi, [rel payload]
    
    ; Store
    rep movsb
    
    ; Call
    call r11
    
    ; Move buffer pointer
    lea r11, [r11 + BUFFER_STEP]
    cmp r11, r12
    jb .next
    mov r11, r10

.next:
    jmp .loop

.end:
    pop r12
    pop r11
    pop r10
    ret

Compile with:

nasm smc.asm -f elf64 -o smc.o
gcc -c main.c -O2 -o main.o
gcc main.o smc.o -o prog

I measure the program's execution time and the MACHINE_CLEARS.SMC performance counter using

sudo perf stat -e r04c3 ./prog

Results on an Intel Core i7-7567U:

`BUFFER_LENGTH` (bytes)	`BUFFER_STEP` (bytes)	`MACHINE_CLEARS.SMC`	Execution time (seconds)
1 x 4K	0	199'999'982	14.53
1 x 4K	64	199'999'740	14.91
256 x 4K	2048	105'550'699	7.89
256 x 4K	4096	130'573'069	9.83

Although I am shifting the store destination (writing to a different location each time), I still get millions of SMC machine clears, leading to a massive performance penalty.

Adding various fences and/or serializing instructions before/after the store does not yield any considerable improvement. Note that, while the shifting somewhat reduces the number of machine clears, it also leads to a large number of branch target mispredictions at the call instruction.

When I run the same program with a 4K buffer, 0 byte steps, mfence after the store, and call payload instead of call r11, it only takes around 1.74 seconds, which is expected, given the total number of executed instructions.

What is causing this huge number of machine clears, and how can I work around that?

Remember CPUs are aggressively out-of-order execution; they're fetching / decoding far ahead of where they're currently executing, so branch prediction can allow that buffer to be in the pipeline already when stores to it are executing. Stores snooping "code" addresses in or near the pipeline -> SMC stall. See [Observing stale instruction fetching on x86 with self-modifying code](https://stackoverflow.com/q/17395557) for more about the conditions that trigger it. Modern CPUs (Skylake) are pretty good about only SMC stalling on writes within one cache line of code, not a whole page. — Peter Cordes, Feb 05 '21 at 11:44
How closely does this reflect any real use-case? Why not execute the 64 bytes in place instead of copying them? (Trying to avoid a branch mispredict by having a data dependency?) If you're just looking for an explanation for an artificial microbenchmark, well this seems totally normal to me based on the known mechanism for SMC detection in a heavily pipelined CPU with branch prediction. — Peter Cordes, Feb 05 '21 at 11:47
@PeterCordes There probably is not yet a "real" use-case; I am playing around with this in an early stage of a research project, where I want to prevent an attacker from using cache side-channels for tracking execution, by copying all executed code to a common location. But this is still very experimental, and I am mostly curious what is causing the observed behavior, as it doesn't really make sense to me. I would have expected that adding serializing instructions after the store should fix that prefetching problem, but turns out that this is not the case :/ — janw, Feb 05 '21 at 11:58
Yeah, that's slightly less obvious. But barriers only take effect at the issue/rename stage, because it's the gateway into the first out-of-order part of the CPU (the back-end, the ROB + RS). So mfence doesn't stop later instructions from being fetched and decoded. (Or cause already-fetched instructions to be discarded). It only stop them from issuing into the back-end. — Peter Cordes, Feb 05 '21 at 12:03
And that's only because MFENCE was over-strengthened in a microcode update; it's not required and `lock or` full barriers don't fully block OoO exec: [Are loads and stores the only instructions that gets reordered?](https://stackoverflow.com/a/50496379) / [Does lock xchg have the same behavior as mfence?](https://stackoverflow.com/q/40409297). To portably get the effect on early CPUs like Haswell, you might need mfence + lfence. Or an official *serializing* instructions like CPUID. (Worth trying CPUID on your Kaby Lake, but I doubt it would discard fetches from before it reached decoders) — Peter Cordes, Feb 05 '21 at 12:05
Oh, that's interesting, I mostly believed that `mfence` serves as a full barrier, as it works well on performance measurements with `rdtsc`. I will look into that, thanks for the pointers :) However, I was already a bit suspicious, and thus tried `cpuid` - same result, still lots of machine clears. This was what surprised me most, and finally led to this question. I will do some more experiments on newer machines, and see whether they show the same behavior. — janw, Feb 05 '21 at 12:16
MFENCE is a full barrier *for memory*. That's the standard meaning of the phrase "full barrier" - I think you meant to say that you thought MFENCE was guaranteed to also be an execution barrier like LFENCE, rather than only having that property as an implementation detail on Skylake. Thanks for checking CPUID, I was pretty sure it wouldn't avoid machine clears, but was curious whether a "serializing instruction" could be stronger in this case than draining the ROB + store buffer. It seems not. (Normally for `rdtsc` one uses `lfence` to only drain ROB, not store buffer, unless you want that) — Peter Cordes, Feb 05 '21 at 12:26
Err yes, that is what I meant, and what I experienced. At some point, IIRC while doing some cache measurements, I moved from `lfence` to `mfence`, and since there did not appear to be a difference, I stayed with that. Interesting to see that this is merely an implementation detail, than guaranteed behavior. — janw, Feb 05 '21 at 12:40
MFENCE is slower than LFENCE (especially if there are some queued cache-miss stores you *don't* want to measure in your timed interval) so I'd tend to go with LFENCE. Fun fact: on AMD before Spectre mitigation was a thing, LFENCE wasn't guaranteed to be an execution barrier. Now AMD has a toggle for that, and OSes set it so they can use LFENCE for Spectre mitigation before some branches. — Peter Cordes, Feb 05 '21 at 12:49

Avoiding self-modifying code (SMC) machine clears when writing to executable memory

0 Answers0