How is x86 instruction cache synchronized?

Question

I like examples, so I wrote a bit of self-modifying code in c...

#include <stdio.h>
#include <sys/mman.h> // linux

int main(void) {
    unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|
                            MAP_ANONYMOUS, -1, 0); // get executable memory
    c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits)
    c[1] = 0b11000000; // to register rax (000) which holds the return value
                       // according to linux x86_64 calling convention 
    c[6] = 0b11000011; // return
    for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run
        // rest of immediate data (c[3:6]) are already set to 0 by MAP_ANONYMOUS
        printf("%d ", ((int (*)(void)) c)()); // cast c to func ptr, call ptr
    }
    putchar('\n');
    return 0;
}

...which works, apparently:

>>> gcc -Wall -Wextra -std=c11 -D_GNU_SOURCE -o test test.c; ./test
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

But honestly, I didn't expect it to work at all. I expected the instruction containing c[2] = 0 to be cached upon the first call to c, after which all consecutive calls to c would ignore the repeated changes made to c (unless I somehow explicitedly invalidated the cache). Luckily, my cpu appears to be smarter than that.

I guess the cpu compares RAM (assuming c even resides in RAM) with the instruction cache whenever the instruction pointer makes a large-ish jump (as with the call to the mmapped memory above), and invalidates the cache when it doesn't match (all of it?), but I'm hoping to get more precise information on that. In particular, I'd like to know if this behavior can be considered predictable (barring any differences of hardware and os), and relied on?

(I probably should refer to the Intel manual, but that thing is thousands of pages long and I tend to get lost in it...)

What environment/compiler do you have where `mmap` and the odd `0b...` binary syntax (not valid C) works? — R.. GitHub STOP HELPING ICE, Jun 12 '12 at 01:14
`mmap` is pure POSIX, but the `0b...` stuff looked like some legacy DOS compiler thing... I had no idea GCC had it. — R.. GitHub STOP HELPING ICE, Jun 12 '12 at 01:54
@WillBuddha: mmap is totally lacking from both the c11 and gnu standards -- its part of POSIX which is a completely independent standard. If your system supports POSIX it will support mmap regardless of what compiler flags you use. If it doesn't support POSIX, mmap (probably) won't work, regardless of what -std flag you use. — Chris Dodd, Jun 12 '12 at 02:19
Strictly speaking, a feature test macro (usually specified in the form `-D_POSIX_C_SOURCE=200809L` or `-D_XOPEN_SOURCE=700`) is needed to get POSIX interfaces. — R.. GitHub STOP HELPING ICE, Jun 12 '12 at 02:25
similar http://stackoverflow.com/questions/1756825/how-can-i-do-a-cpu-cache-flush You should work on pure assembly instead of C to better understand the x86 part. — Ciro Santilli OurBigBook.com, Nov 08 '15 at 10:36

score 27 · Accepted Answer · edited Jun 20 '20 at 09:12

What you do is usually referred as self-modifying code. Intel's platforms (and probably AMD's too) do the job for you of maintaining an i/d cache-coherency, as the manual points it out (Manual 3A, System Programming)

11.6 SELF-MODIFYING CODE

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated.

But this assertion is valid as long as the same linear address is used for modifying and fetching, which is not the case for debuggers and binary loaders since they don't run in the same address-space:

Applications that include self-modifying code use the same linear address for modifying and fetching the instruction. Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue.

For instance, serialization operation are always requested by many other architectures such as PowerPC, where it must be done explicitely (E500 Core Manual):

3.3.1.2.1 Self-Modifying Code

When a processor modifies any memory location that can contain an instruction, software must ensure that the instruction cache is made consistent with data memory and that the modifications are made visible to the instruction fetching mechanism. This must be done even if the cache is disabled or if the page is marked caching-inhibited.

It is interesting to notice that PowerPC requires the issue of a context-synchronizing instruction even when caches are disabled; I suspect it enforces a flush of deeper data processing units such as the load/store buffers.

The code you proposed is unreliable on architectures without snooping or advanced cache-coherency facilities, and therefore likely to fail.

Hope this help.

Related: [Observing stale instruction fetching on x86 with self-modifying code](https://stackoverflow.com/q/17395557) - current x86 CPUs are in practice stronger than the manual guarantees. Even from different virtual mappings of the same physical page, you can't get them to execute a stale instruction after a store. And yes, x86 is rare in having coherent I-cache (and pipeline); this all evolved to maintain backwards compat with existing code in the wild, that worked on early non-pipelined x86 CPUs like 8086 and 386. Unlike other ISAs being pipelined from the start. — Peter Cordes, Oct 12 '22 at 20:12

score 6 · Answer 2 · answered Jun 12 '12 at 01:15

6

It's pretty simple; the write to an address that's in one of the cache lines in the instruction cache invalidates it from the instruction cache. No "synchronization" is involved.

answered Jun 12 '12 at 01:15

R.. GitHub STOP HELPING ICE

208,859
35
376
711

Invalidating it from the icache is almost never enough, it could be somewhere along the pipes already. If your system reserves a relatively strict memory ordering, you'll also need a deep flush to clear any old copy of that code line, and any younger dependent calculation (basically everything) – Leeor Jun 25 '13 at 21:07
1

@Leeor: As this question is specifically about x86, I'd like to add that as far as I know, the automatic invalidation of the cache on Intel's processors is accompanied by a deep flush, so SMC just works (though at a high cost to performance). – Nathan Fellman Aug 22 '13 at 19:30
1

It would be more correct to say "it triggers a Self-Modifying-Code machine nuke (aka pipeline flush)". Intel CPUs have a perf counter event for that. (Something like `machine_nuke.smc`, IIRC). Also, I think I recall reading that a `call` or `jmp` instruction like the OP's code contains is essential for guaranteed detection of SMC. A store that modifies the next instruction after itself might not have an immediate effect on some CPUs. – Peter Cordes Jul 11 '16 at 00:36

score 5 · Answer 3 · answered Aug 22 '13 at 18:49

By the way, many x86 processors (that I worked on) snoop not only the instruction cache but also the pipeline, instruction window - the instructions that are currently in flight. So self modifying code will take effect the very next instruction. But, you are encouraged to use a serializing instruction like CPUID to ensure that your newly written code will be executed.

score 4 · Answer 4 · answered Jun 12 '12 at 01:15

4

The CPU handles cache invalidation automatically, you don't have to do anything manually. Software can't reasonably predict what will or will not be in CPU cache at any point in time, so it's up to the hardware to take care of this. When the CPU saw that you modified data, it updated its various caches accordingly.

answered Jun 12 '12 at 01:15

bta

43,959
6
69
99

1

It isn't necessarily fully automatic. For other processors, e.g. ARM, you may need to insert a special instruction to invalidate the pipeline/cache. – starblue Jun 12 '12 at 11:35
This isn't true for the instruction cache in Intel processors. Writing to the code segment **does not** always invalidate the L1 code cache and iTLB. Special care should be taken when writing self-modifying code. – ugoren Jun 12 '12 at 12:01
@ugoren- In this case, though, the code shouldn't be in the i-cache yet because it was freshly created (due to MAP_PRIVATE being copy-on-write) and nothing has ever attempted to execute it. If this was an attempt to modify existing code and not create new code then yes, additional precautions may be necessary. Although for the sake of programmer sanity and portability, I would hope that 'mmap' and the compiler would take care of this for you as much as possible. – bta Jun 12 '12 at 12:19

score 4 · Answer 5 · answered Jun 19 '13 at 22:08

I just reached this page in one of my Search and want to share my knowledge on this area of Linux kernel!

Your code executes as expected and there are no surprises for me here. The mmap() syscall and processor Cache coherency protocol does this trick for you. The flags "PROT_READ|PROT_WRITE|PROT_EXEC" asks the mmamp() to set the iTLB, dTLB of L1 Cache and TLB of L2 cache of this physical page correctly. This low level architecture specific kernel code does this differently depending on processor architecture(x86,AMD,ARM,SPARC etc...). Any kernel bug here will mess up your program!

This is just for explanation purpose. Assume that your system is not doing much and there are no process switches between between "a[0]=0b01000000;" and start of "printf("\n"):"... Also, assume that You have 1K of L1 iCache, 1K dCache in your processor and some L2 cache in the core, . (Now a days these are in the order of few MBs)

mmap() sets up your virtual address space and iTLB1, dTLB1 and TLB2s.
"a[0]=0b01000000;" will actually Trap(H/W magic) into kernel code and your physical address will be setup and all Processor TLBs will be loaded by the kernel. Then, You will be back into user mode and your processor will actually Load 16 bytes(H/W magic a[0] to a[3]) into L1 dCache and L2 Cache. Processor will really go into Memory again, only when you refer a[4] and and so on(Ignore the prediction loading for now!). By the time you complete "a[7]=0b11000011;", Your processor had done 2 burst READs of 16 bytes each on the eternal Bus. Still no actual WRITEs into physical memory. All WRITEs are happening within L1 dCache(H/W magic, Processor knows) and L2 cache so for and the DIRTY bit is set for the Cache-line.
"a[3]++;" will have STORE Instruction in the Assembly code, but the Processor will store that only in L1 dCache&L2 and it will not go to Physical Memory.
Let's come to the function call "a()". Again the processor do the Instruction Fetch from L2 Cache into L1 iCache and so on.
Result of this user mode program will be the same on any Linux under any processor, due to correct implementation of low level mmap() syscall and Cache coherency protocol!
If You are writing this code under any embedded processor environment without OS assistance of mmap() syscall, you will find the problem you are expecting. This is because your are not using either H/W mechanism(TLBs) or software mechanism(memory barrier instructions).

What is a *TLB2*? Normally, there aren't seperate *TLB* entries for code/data; but I know the x86 is a little weird. *TLB* is a separate *MMU* cache and not related to *dcache* or *icache*. — artless noise, Jun 19 '13 at 22:39
TLB2 => I mean TLB of L2 Cache. TLB exists for MMU and for all levels of Caches inside and/or outside of processor cores. Kernel should manage all of these TLBs properly in order to make effective use of processor H/W. Cache TLBs are used by the processor H/W to take care of Cache coherency protocol. MMU TLB is used by the MMU unit for virtual to physical translation when the processor is placing the virtual address in the Bus after taking Cache miss in all levels of Caches(Typically L1, L2. Some cases even L3). — sukumarst, Jun 20 '13 at 23:08
"TLB exists for MMU and for all levels of Caches inside and/or outside of processor cores." - actually no, not for most CPUs. Not for caches that are physically indexed and physically tagged. Some x86 CPUs may have an L2 TLB, but that does not necessarily have anything to do with the L2 cache. As far as I know no x86 has an L3 TLB. However, one of my favorite implementations puts the L2 TLB and the L2 unified I/D cache in the same physical array - so that you have a sngle structure. // You may be thinking of GPUs, which often have virtual caches, with TLBs at each. — Krazy Glew, Aug 22 '13 at 18:42

How is x86 instruction cache synchronized?

5 Answers5

Linked

Related