Is it possible, or how hard it is, to change the op code of an instruction in x86 architecture?

Question

For instance, PUSH imm32 has the op code 68h. Is it possible to use another number, for example, 69h, to "represent" this instruction (assume this number is not being used by other instructions)?

By "represent", I mean wherever there is a PUSH instruction in the assembly, 69h will appear in the binary executable. When it is eventually being fetched and executed by the CPU, it will be transfer back to 68h.

I understand each op code is specifically designed according to the CPU circuit, but is it possible that I just want to use another hex number as a surrogate?

Of course I won't make any change on the CPU and I still want the instruction be executed on x86 architecture.

Update: why do I ask this question?

Probably you know of the Return Oriented Attack, which purposefully mis-interpret the stream of machine languages and take advantage that there are many C3 (that is, ret) among standard library. My initial thought was, if we are able to change the opcode of return from C3 to some other code, preferably 2 bytes, then the ROA will not work. I am not an expert in architecture field and I just found my thought won't work in reality. Thanks for all your responses.

If you want to change what ends up in the binary executable as a result of a particular instruction, you "just" need to modify/rebuild the compiler/assembler. There's source out there for gcc, gas, etc. It's weird, but you could do it. As for "When it is [...] executed by the CPU, it will be transfer back to 68h", who do you think is going to do the "transfer?" This cpu isn't going to "know" that 69 is supposed to be swapped back to 68. — David Wohlferd, Jun 15 '18 at 04:53
Sure you could make a modified version of x86 where instructions have different opcodes. You couldn't run code architecture on an x86 CPU, though. You can't do anything to make a normal x86 CPU decode differently. — Peter Cordes, Jun 15 '18 at 04:53
You can modify the binary after loading it into memory, before executing it, but that's usually very error prone and require often special ways to prepare the binary areas to be encoded/decoded to avoid some accidental defects, and the instructions are decoded already well before execution. Once the CPU will read 0x69 from memory, it will decode the 0x69, there's no way to put anything between that on ordinary x86 PC. — Ped7g, Jun 15 '18 at 06:51
typo in my last comment: You couldn't run code *for that* architecture on an x86 CPU, though. In theory (if you were Intel or AMD) you could use a microcode update to change how *some* instructions decode, but single-uop instructions like `push imm32` are probably hard-wired in silicon without any indirection through reprogrammable lookup tables in current x86 hardware. — Peter Cordes, Jun 15 '18 at 07:49
Thank you all for your comments! --David, "who do you think is going to do the transfer", this is exactly what I want to know. I understand that the instruction pointed by $EIP will be fetched and feed into CPU for execution, but who actually do the job, is it hardware or software? --Peter, yes I understand that eventually the x86 CPU will only recognize 68h. What I am thinking is that is it possible to let 69h represent the instruction when it is loaded into memory, but before it is really send to CPU for execution, we can transfer it back to 68h. --Tommy, no change on CPU, of course — SamTest, Jun 15 '18 at 13:07
Use @username to notice people when you reply to them. But anyway, it's pure hardware that fetches code pointed to by EIP. There's no mechanism for customizing how that happens, and which opcodes decode to what. That's all hard-wired. That's *how* the hardware runs software. — Peter Cordes, Jun 15 '18 at 13:38
@PeterCordes, thanks, I am new to this community. If using 69 to represent the instruction is impossible, will it be possible to use 00 68 instead of 68 to represent the instruction? Or in other words, when 68 is loaded into instruction register, the most significant byte will be all 0, so there won't be any difference if I load 00 68 or 68 into the instruction register, it this correct? — SamTest, Jun 15 '18 at 13:51
No, of course that won't work either. `00` is the opcode for a memory-destination add. You could just try this with `db 0, 0x68, 1, 2, 3, 4` in a `.asm`, assemble it, and then disassemble the result. A disassembler with decode x86 machine code in software the same way the hardware would; that's the whole point. — Peter Cordes, Jun 15 '18 at 14:15
@PeterCordes I think my question now is, is it possible to enforce 00 68 to be recognized as an instruction and being loaded into CPU, and when CPU sees 68 and 00 68, will it behaves the same? I know there are some 2-bytes opcode instructions, how do they look like? I am right now searching online and try to find answers myself. I would really appreciate if you can give me any hints! — SamTest, Jun 15 '18 at 17:34
No, I already explained why. I tried it, and the disassember says that decodes as `00 68 00 add BYTE PTR [eax+0x0],ch`. A `0x0` is no less meaningful than a `0x68`. — Peter Cordes, Jun 15 '18 at 18:08
On the face of it, this question seems ridiculous. What exactly are you trying to do? Are you trying to somehow obfuscate your program's executable code, and somehow have the CPU (or the instruction loader) de-obfuscate it for you? — Jim Mischel, Jun 15 '18 at 21:31
@JimMischel Do not judge a question by its face :) I do have a legitimate reason to ask it, see me updated question. — SamTest, Jun 19 '18 at 13:51

Peter Cordes · Accepted Answer · 2018-06-19T19:52:23.270

My initial thought was, if we are able to change the opcode of return from C3 to some other code, preferably 2 bytes, then the ROA will not work.

No, x86 instruction encodings are fixed, and mostly hard-wired in the silicon of the decoders inside the CPU. (Micro-coded instructions redirect to microcode ROM for the definition of the instruction, but the opcode that's recognized as an instruction is still hard-wired.)

I think even a microcode update from Intel or AMD couldn't change their existing CPUs to not decode C3 as ret. (Although possibly they could make some other multi-byte sequence also decode as a very slow micro-coded ret, but probably only by taking over the encoding for an existing micro-coded instruction.)

A CPU that didn't decode C3 as ret would not be an x86 CPU anymore. Or I guess you could make it a new mode, where instruction encodings were different. It wouldn't be binary-compatible with x86 anymore, though.

It's an interesting idea, though. Single-byte RET on x86 makes it significantly easier to chain gadgets together (https://en.wikipedia.org/wiki/Return-oriented_programming#On_the_x86-architecture). (Or means there are many more gadgets that can be chained, giving you a larger toolbox.)

I wouldn't hold my breath waiting for CPU vendors to provide a new mode where ret uses a 2-byte opcode. It would be possible, though (for CPU vendors to make a new design, not for you to hack your existing CPU). By making it a separate mode (like 64-bit long mode vs. 32-bit compat mode under a 64-bit kernel, vs. "legacy mode" with a 32-bit kernel) OSes would still work on such CPUs, and you could mix/match user-space processes under the same kernel, some compiled for x86 and some for new86.

If vendors were going to introduce a new incompatible mode that couldn't run existing binaries, hopefully they'd make other cleanups to the instruction set. e.g. removing the false dependency on FLAGS for variable count shifts by having them always write FLAGS even if the count = 0. Or redoing the opcodes entirely to not spend so much coding space on 1-byte xchg eax, r32, and shorten the encodings for SIMD instructions. But then they couldn't share as many decoder transistors with the regular x86 decoders. And any changes like EFLAGS semantics for shifts could require changes in the back-end, not just the decoders.

They could also make [rsp+disp8/32] addressing modes 1 byte shorter, maybe using a different register as the one that always needs a SIB byte even with no index. (-fomit-frame-pointer is typical now, so it sucks that addressing relative to the stack-pointer costs an extra byte.)

See Agner Fog's Stop the instruction set war blog post for more details about how much of a mess x86 instruction encoding is.

How much change to the CPU circuit design would be required at minimum to make c3 the start of a 2-byte instruction that required the 2nd byte to be 00?

Intel CPUs decode in multiple stages:

The instruction-length pre-decoder finds instruction boundaries, placing instruction bytes in a queue (processing up to 16 bytes or 6 instructions, whichever is lower, per cycle). See https://www.realworldtech.com/sandy-bridge/3/ for a block diagram.
The decoders grab 4 (or 5 in Skylake) instructions from that queue, and feed them in parallel to the actual decoders. Each one outputs 1 or more uops. (See the next page in David Kanter's SnB writeup).

Some CPUs mark instruction boundaries in the L1i cache, and do this decoding as a line arrives from L2. (AMD did this more recently than Intel, but IIRC Ryzen doesn't, and Intel hasn't in P6 or SnB-family. See Agner Fog's microarch guide.)

The fact that c3 is a one-byte opcode with no following bytes is hard-wired into the instruction-length decoders, so that would have to change.

But then how to handle the 2nd byte? You could either have the decoder that gets c3 xx check that xx == 00 and raise a #UD exception if not (UnDefined instruction, aka illegal instruction).

Or it could decode it as an imm8 operand, and have an execution unit check that the operand was 0.

It's probably easier to have the decoders do this mode-dependent check on the next byte, because they have to decode other insns differently for different modes anyway.

00 isn't "special". The regular decoders probably receive instruction bytes in a wide input that's probably 15 bytes long (max x86 instruction length). But there's no reason to assume they would look at bits/bytes past the instruction length and fault if it wasn't zero-extended. It might be designed that way, but just as likely the handing for 1-byte opcodes like c3 is hard-wired and doesn't have any higher bits ANDed, ORed, or XORed with any of the opcode bits.

An opcode or whole insn isn't an integer that has to be zero-extended. You can't assume that there's anything like an "instruction register".

Making c3 xx not decode as ret for xx!=0 would still break essentially all existing binaries, and still require a new mode if you were making a CPU that could operate that way.

On CPUs that mark instruction boundaries in L1i cache, always treating ret as a 2-byte instruction (not including prefixes) wouldn't work. It's not that rare for the byte right after a ret to be a jump target, or a different function. Jumping to the "middle" of another instruction would force such a CPU to redo the instruction-boundary marking, starting from that point in the cache line, and then you'd have another problem when you ran the ret again.

Also, a c3 in the last byte of a page, followed by an unmapped page, must not pagefault. But that would happen if the instruction-length decoding stage always fetched another byte after c3 before letting it decode. (Running code from uncacheable memory would also make this count as an observable change. UC is the CPU equivalent of volatile)

I suppose you could maybe have the length-decoding stage tack on a fake 00 byte for the decoders if running in a mode where ret is single byte. ret is an unconditional jump, but it can fault if [rsp] isn't readable. But I think the exception frame would just have the start address of the instruction, not a length. So it might ok for the rest of the pipeline to think it was a 2-byte instruction when it was actually only 1.

But it still has to go in the uop-cache somehow, and the uop cache needs to care about insn start/end addresses even for unconditional jumps. For an instruction that spans a 64-byte cache-line boundary, it would need to invalidate the instruction if either changed.

My understanding is that real-life CPU design is always harder and more complex than you imagine from looking at block diagrams like David Kanter's articles.

And BTW, it's not particularly relevant how small a change in the decoders would be needed. The fact that only a CPU vendor could make this change in a new design makes your idea a total non-starter, outside of instruction-set design ideas. It's slightly more plausible than a complete re-organization of x86 machine code, because it can still share almost all of the decoder transistors with existing modes.

Supporting a whole new mode for this would be significant, requiring changes to the CPU's code segment descriptor (GDT entry) decoding.

It would be a much easier change to create a CPU that always requires c3 to be followed by 00, but then it wouldn't be an x86 and couldn't run the vast majority of code. There's zero chance of Intel or AMD ever selling a CPU like that.

Thanks for your detailed explanation. What I am thinking is that there don't even need to be a complete "redesign" of the circuit of the CPU, it can be a slight change on how it recognize an opcode, which should be the CU's task. For instance, I don't really to change C3 to AB CD to represent ret, I can simply use C3 00 to represent ret. And when C3 00 being loaded into the instruction register, it still acts as C3, so no need to change the wiring of the CPU, just make it not recognize C3 but C3 00. — SamTest, Jun 19 '18 at 16:15
@ming. *no*, that's not how decoding works. x86 instructions are variable length, and don't silently eat up surrounding zeros. There's nothing special about `00` vs. any other byte in x86 machine code. There is no "instruction register" like on a RISC with fixed-width instructions. `00 c3` decodes as [an `add` with c3 being the ModRM byte](http://felixcloutier.com/x86/ADD.html). `c3 00` decodes as a ret, leaving the `00` as the start of the next instruction (an `add`). — Peter Cordes, Jun 19 '18 at 16:20
@ming: But anyway, leaving a padding `00` after `ret` in your code doesn't help at all. `c3` bytes in the middle of other instructions will still execute as `ret`, regardless of the next byte, unless you have a modified CPU that's not binary-compatible with x86. Leaving extra padding after `ret` in your code doesn't stop attackers from finding gadgets. — Peter Cordes, Jun 19 '18 at 16:21
yes I understand what you said, and that's why I give up think about this. But I feel you have a bit misunderstanding about what I say. What I trying to say is, if we can make the CUP DOSE NOT recognize C3 are ret, and instead recognize C3 00 as ret, this will solve the problem, because there are not many C3 00 in the library. And by doing this, we do not actually need to change the CPU circuit design, because C3 and C3 00 essentially have the same effect to transistors. — SamTest, Jun 19 '18 at 18:09
I understand an instruction executing cycle is fetch-decode-execute, if we can modify the fetch phase, when it sees C3 it does not consider it as a complete instruction and try to load C3 00 into the instruction register, this will be all done. — SamTest, Jun 19 '18 at 18:15
@ming: The fetch and pre-decode pipeline stages are part of the CPU's circuit design! Also, fetch-decode-execute is over-simplified. I updated my answer with more about what Intel or AMD would have to add to make a CPU support this. But remember you'd need a new mode where it decodes this way, otherwise you can't run existing software. And that would require more HW changes. — Peter Cordes, Jun 19 '18 at 19:57

score 3 · Answer 2 · answered Jun 15 '18 at 14:13

In theory yes...

You could use Undefined Opcode exception in case you found spare opcode (not to many free spots though). Exception handler would modify memory location with proper opcode and re-execute processing.

But it would leave "good" opcode in this memory location. You could set single step interrupt handler to "fix" opcode stored in memory to "fake" one after "good" opcode was executed and disable it afterwards in order to not impact performance.

Additionally fake opcode have to be same size (or longer) then proper one otherwise you would have to backup following instructions from being corrupted (overwritten by "good" opcode). In case fake is longer than true replacement instruction extra spced could be NOP padded.

I don't have to mention it is cumbersome AF. It would be quite simple in DOS for for modern OSes it is almost no-go solution.

If the "fake" is shorter, the handler could know to increment the exception-return's saved RIP/EIP by the length of the instruction you're emulating, rather than the length of the illegal instruction you want to skip. — Peter Cordes, Jun 15 '18 at 14:18
or you skip the instruction by the exception handler and execute the real instruction there. this will leave the memory untouched, but is way more complicated to implement right. — sivizius, Jun 16 '18 at 11:53

Is it possible, or how hard it is, to change the op code of an instruction in x86 architecture?

2 Answers2