Can a single byte instruction be executed while being only partially overwritten?

Question

I have made an experiment in which a new thread executes a shellcode with this simple infinite loop:

NOP
JMP REL8 0xFE (-0x2)

This generate the following shellcode:

0x90, 0xEB, 0xFE

After this infinite loop there are other instructions ending by the overwriting of the destination byte back to -0x2 to make it an infinite loop again, and an absolute jump to send the thread back to this infinite loop.

Now I was asking myself if it was possible that the instruction of the jump was executed while the single byte of the destination is only partially overwritten by the other thread. For example, let's say that the other thread overwrites the destination of the jump (0xFE, or 11111110 in binary) to 0x0 (00000000) to release the thread of this infinite loop. Could it happen that the jump goes to let's say 0x1E (00011110) because the destination byte wasn't completely overwritten at that nanosecond? Before asking this question here I have done the experiment myself in a C++ program and I have let it run for some hours without it never missing a single jump. If you want to have a look at the code I made for this experiment I have uploaded it to GitHub

Accordingly to this experiment, it seems to be impossible that an instruction is executed while being only partially overwritten . However, I have very little knowledge in assembly and in processors, this is for this reason that I ask the question here: Can anyone confirm my observation please? Is it indeed impossible to have an instruction executed while being partially overwritten by another thread? Does anyone knows why for sure?

Thank you very much for your help and knowledge on that, I did not know where to look for such an information.

This loop is not very power friendly. You should use the two byte `PAUSE` instruction (0xF3 0x90) instead of the one byte `NOP` as a hint to the CPU that it in a spin-wait loop. — 1201ProgramAlarm, Dec 17 '17 at 21:01
Why are you using shellcode for this, anyway? I thought you were doing this as part of an exploit, but if not, see the stuff I added to my answer re: performance and why this sounds like a terrible idea vs. spinning on a data load. — Peter Cordes, Dec 18 '17 at 04:36

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

No, byte stores are always atomic on x86, even for cross-modifying code.

See Observing stale instruction fetching on x86 with self-modifying code for some links to Intel's manuals for cross modifying code. And maybe Reproducing Unexpected Behavior w/Cross-Modifying Code on x86-64 CPUs

Of course, all the recommendations for writing efficient cross-modifying code (and running code that you just JIT-compiled) involve avoiding stores into pages that other threads are currently executing.

Why are you doing this with "shellcode", anyway? Is this supposed to be part of an exploit? If not, why not just write code in asm like a normal person, with a label on the jmp instruction so you can store to it from C by assigning to extern char jmp_bytes[2]?

And if this is supposed to be an efficient cross-thread notification mechanism... it isn't. Spinning on a data load and a conditional branch with a pause loop would allow a lower latency exit from the loop than a self-modifying code machine nuke that flushes the whole pipeline right when you want it to finally be doing something useful instead of wasting CPU time. At least several times the delay of a simple branch miss.

Even better, use an OS-supported condition variable so the thread can sleep instead of heating up your CPU (reducing the thermal headroom for the CPU to turbo above its rated clock speed up when there is work to do).

The mechanism used by current CPUs is that if a store near the EIP/RIP or any instruction in flight in the pipeline is detected, it does a machine clear. (perf counter machine_clears.smc, aka machine nuke.) It doesn't even try to handle it "efficiently", but if you did a non-atomic store (e.g. actually two separate stores, or a store split across a cache line boundary) the target CPU core could see it in different parts and potentially decode it with some bytes updated and other bytes not. But a single byte is always updated atomically, so tearing within a byte is not possible.

However, x86 on paper doesn't guarantee that, but as Andy Glew (one of the architects of Intel's P6 microarchitecture family) says, implementing stronger behaviour than the paper spec can actually be the most efficient way to meet all the required guarantees and run fast. (And / or avoid breaking existing code in widely-used software!)

Thank you so much Peter for this useful information, this is good news for me then :) — Pierre Ciholas, Dec 17 '17 at 21:14
@PierreCiholas if it is, then show it! (by accepting his answer) — Tommylee2k, Dec 18 '17 at 09:54
@Tommylee2k Oops, thank you I just did. I am sorry I am not yet used to StackOverflow's system. Thanks again. — Pierre Ciholas, Dec 19 '17 at 01:26

Can a single byte instruction be executed while being only partially overwritten?

1 Answers1