6

I've been trying to understand the purpose of the 0x40 REX opcode for ASM x64 instructions. Like for instance, in this function prologue from Kernel32.dll:

enter image description here

As you see they use push rbx as:

40 53      push        rbx 

But using just the 53h opcode (without the prefix) also produces the same result:

enter image description here

According to this site, the layout for the REX prefix is as follows:

enter image description here

So 40h opcode seems to be not doing anything. Can someone explain its purpose?

c00000fd
  • 20,994
  • 29
  • 177
  • 400
  • Seems like there are 2 questions here: 1) What does it do. 2) Why is it there? What it does (according to the references I'm reading) is nothing. So, why is it there? My first guess was the same as Nathan's: Some type of alignment/filler. But I don't see anything in that code that would benefit from an alignment there. So, here's a theory: Paging thru kernel32.dll, there's LOTS of `nop`s. It's almost like someone is trying to keep certain code at specific addresses. So maybe `rex push rbx` is patched over some code that was 1 byte shorter? – David Wohlferd May 10 '18 at 05:22
  • That's weird, `push rbx` has 64-bit operand size so it should be using `REX.W=1` (0x48) if they're going to pad with a REX prefix at all (not needed because `push` already defaults to 64-bit operand size). I guess that confirms that `REX.W=0` is safely ignored for `push` by all existing CPUs, though, if you found this in `kernel32.dll` on Windows. Oh, and NASM encodes `push r12` as `41 54`, i.e. using REX.W=0,B=1`. Apparently I need to go update my answer on [How many bytes does the push instruction pushes onto the stack when I don't specify the operand size?](//stackoverflow.com/q/45127) – Peter Cordes May 11 '18 at 04:28
  • The link to the site is broken. – drudru Sep 21 '20 at 20:52
  • @duru, the link is now [X86-64 Instruction Encoding](https://www-user.tu-chemnitz.de/~heha/hsn/chm/x86.chm/x64.htm) – zhenguoli Apr 15 '21 at 04:34

2 Answers2

4

the 04xh bytes (i.e. 040h, 041h... 04fh) are indeed REX bytes. Each bit in the lower nibble has a meaning, as you listed in your question. The value 040h means that REX.W, REX.R, REX.X and REX.B are all 0. That means that adding this byte doesn't do anything to this instruction, because you're not overriding any default REX bits, and it's not an 8-bit instruction with AH/BH/CH/DH as an operand.

Moreover, the X, R and B bits all correspond to some operands. If your instruction doesn't consume these operands, then the corresponding REX bit is ignored.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Nathan Fellman
  • 122,701
  • 101
  • 260
  • 319
  • Yeah, I know that. So why using it like they did in the first function that I showed above? – c00000fd May 09 '18 at 18:44
  • Or, was their compiler using `40h` opcode as some sort of an alignment `nop`-type filler? – c00000fd May 09 '18 at 18:50
  • @HansPassant: Hah, interesting. [This (somewhat old) article](https://jpassing.com/2011/05/03/windows-hotpatching-a-walkthrough/) on hotpatching explains the purpose of nop-type instructions at the beginning of functions. Although in the case of my example `40 53 push rbx` instruction is not just a dud, like five `nop`s or `mov edi, edi` are (that were given in that article.) It actually has a purpose. It's just one byte longer than what it's supposed to be. Am I missing something? – c00000fd May 10 '18 at 19:05
  • 2
    @c00000fd: yes, you're missing something. Microsoft used a redundant REX prefix to make an instruction longer *instead* of using a separate NOP instruction. This makes the code run faster. [What methods can be used to efficiently extend instruction length on modern x86?](//stackoverflow.com/q/48046814). When you hotpatch, you replace some early instructions with a `jmp` to new code, then maybe jump back to the rest of the function. That article you linked with 5x single-byte `nop` instructions is a bad plan; execution could be on the 2nd `nop` when you replace it with a `jmp`. – Peter Cordes May 11 '18 at 04:43
  • @c00000fd: The new code you `jmp` to can contain a `push rbx` if you overwrite that with a `jmp`. Maybe with a 2-byte `jmp short` to another `jmp` in some nearby padding between functions, so you can do it on the fly without worrying about replacing two short instructions with one long instruction, when a process might have RIP = the middle of what will become a `jmp`. – Peter Cordes May 11 '18 at 04:46
  • 2
    @Nathan: `0x40` has an effect for byte registers: it's needed to encode `mov al, sil` for example. (and that's why AH/BH/CH/DH aren't encodable in instructions with a REX prefix, so you can't encode `mov ah, sil`) But yes, for `push` and any opcode other than 8-bit operand-size instructions, `0x40` is redundant. – Peter Cordes May 11 '18 at 04:48
  • @PeterCordes: Oh man, good find. Thanks. Indeed the `40h` prefix is needed to convert `mov al, dh` into `mov al, sil`. So `REX.W=0` is not that "useless" after all. And I thought I am the only one who deals with this weird machine code stuff :) – c00000fd May 11 '18 at 05:48
  • Although I have to disagree with your previous statement on hotpatching. The part about parallel execution hitting the 2nd nop or in some way creating a race condition. To explain: they do it from the kernel by setting the thread's IRQL level to `CLOCK1_LEVEL`, which is way above 2, that will pretty much stop all task switching. It is also high enough to preempt all interrupts. But also before doing that they schedule CPU-specific DPCs on all CPUs but that thread to keep those DPCs busy. This basically turns the patching thread into a single threaded environment for that short while. – c00000fd May 11 '18 at 05:54
  • @c00000fd: You might be interested in [Tips for golfing in x86/x64 machine code](//codegolf.stackexchange.com/q/132981). To squeeze every last byte out of a program / function, you have to know the machine-code rules :) – Peter Cordes May 11 '18 at 05:54
  • @c00000fd: Oh, I only skimmed. So the kernel checks that no stopped threads in the whole system were stopped between two instructions you're going to overwrite with the hot patch? And if it finds any it single-steps until they're out of the function intro? If you're going to make sure no other threads / cores are executing code while you patch it, then in theory you don't need padding at all; you can just copy a whole number of instructions into the new code you `jmp` to replicate whatever you overwrote. But it can simplify the hotpatching system to have some prologue similarity I guess. – Peter Cordes May 11 '18 at 06:02
  • @PeterCordes: By raising IRQL to that level (28 I think) it will prevent anything else on the same CPU core from running. (Since the task scheduler itself runs at IRQL level 2.) The "trick" of keeping DPCs busy is needed to prevent other CPU cores from intervening. As for your second point, then yes, technically they don't need any buffer in the beginning of a function. I guess the reason they use it (such as 5 nops) is so that they can undo the hotpatch if something goes wrong. This way there's no need to remember what was there before. It's just my guess. – c00000fd May 11 '18 at 06:06
  • Maybe @HansPassant can confirm? – c00000fd May 11 '18 at 06:07
  • @c00000fd: It's not concurrent execution that I'm worried about. Any sleeping thread could have stopped at any instruction (because of an interrupt -> context switch), and will fetch code from there when it wakes up and starts running again. If that's now the middle of a `jmp`, you're screwed. Concurrent execution isn't even a problem: x86-64 can atomically replace 8 bytes with `xchg`, or 16 bytes with `lock cmpxchg16b`, at any alignment, and instruction caches are coherent. So you don't have to worry about writing the `jmp` opcode + `rel32` separately or something. – Peter Cordes May 11 '18 at 06:20
  • @PeterCordes: Oh, sorry forgot to address that. I read it this morning, so you may want to double check it in the article. My guess is that they have access to the full snapshot of all contexts for the running threads. So they can make sure to "mutex" that specific buffer. Doesn't `ExLockUserBuffer` function do it? – c00000fd May 11 '18 at 06:26
  • @c00000fd: I have no idea what `ExLockUserBuffer` does. I don't do any low-level Windows programming, and googling on that function name doesn't find documentation for it in the first page of results. But since hotpatching is supported by the kernel, and performance isn't important (hotpatching is very rare, so it's ok if it takes some time). So sure, it could walk the task list and check if any tasks are stopped in an address range mapped to the part of the file you want to modify. And if they are, single-step them until they're not, or return failure. – Peter Cordes May 11 '18 at 06:32
  • @PeterCordes: Listen, I don't know either. It's just my guess after reading that article. Although I do agree with your previous statement that an x86-64 CPU can easily swap 8 bytes atomically. My guess is that Microsoft decided to abandon the hotpatching route (for whatever reason) and instead chose to just nag us to reboot after an update. – c00000fd May 11 '18 at 06:37
  • @c00000fd: I'm pretty sure Windows *does* do hot-patching these days. But I'm also pretty sure we're guessing wrong about how exactly the actual mechanism works / what it requires. :P – Peter Cordes May 11 '18 at 06:42
0

I call this a dummy REX prefix, because it does nothing before a push or pop. I wondered whether it is allowed and your experience show that it is.

It is there because the people at Microsoft apparently generated the above code. I'd speculate that for the extra registers it is needed, so they generate it always and didn't bother to remove it when it is not needed. Another possibility is that the lengthening of the instruction has a subtle effect on scheduling and or aligning and can make the code faster. This of course requires detailed knowledge of the particular processor.

I'm working at an optimiser that looks at machine code. Dummy prefixes are helpful because they make the code more uniform; there are less cases to consider. Then as a last step superfluous prefixes can be removed among other things.

  • The only performance benefit in this case is as an alternative to a separate long-`nop` instruction to give hotpatching something to replace, which would be even worse. See [comments on the other answer](https://stackoverflow.com/questions/50260055/what-is-the-purpose-of-the-40h-rex-opcode-in-asm-x64#comment87586170_50260124). Making instructions longer doesn't hurt much, but does bloat the I-cache footprint and can mean worse packing into uop cache lines. If the *average* instruction length was less than 2 in a block of 32 bytes of machine code, some padding could be good, but it isn't here. – Peter Cordes Oct 11 '19 at 14:19
  • 1
    I'm looking at a decompile of something generated with VS 2008 SP2, and it has 14,621 prologues starting with `40 53 48 83 EC` (`push rbx; sub rsp, x` for ppl other than peter). Though they'll happily use `44 55 53 56 57` ... seems to me that your hotpatching answer is on the mark. Also, there are only ~500 functions with prologues starting with 1 byte instructions, all of which (quick sample) were not actually functions. Total number of functions is ~ 110,000 so ... yeah. It's certainly a bonus for reverse engineers :) – Orwellophile May 17 '21 at 06:01