How to tell length of an x86-64 instruction opcode using CPU itself?

Question

I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction.

But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the instruction size from a binary code? (Maybe even a hack?)

@harold:Yes, it'd be nice to know this from a program/code, i.e. programmatically. — MikeF, Jul 26 '18 at 19:40
The main trick is putting the instruction just before an inaccessible page, trying different offsets until the instruction fetch triggers a page fault. But it's tricky to make it robust. — harold, Jul 26 '18 at 19:44
@harold: Interesting. Thanks. By "instruction fetch" you mean "executing it", right? — MikeF, Jul 26 '18 at 20:03
For instance, installing a trampoline (for hotpatching a binary function.) - know only instruction length not enough. at first can be jmp/call at very begin of the function. so what that jmp can be only 2/5 bytes say ? this change instruction stream. also can be rip addressing instructions in very begin too. it can not be simply moved without fixup relative (rip) off set in instruction ( and can move only to +/- 2GB distance). so simply know instruction length is not enough for your task — RbMm, Jul 27 '18 at 11:33

score 8 · Answer 1 · edited Jun 20 '20 at 09:12

8

The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction.

So if you write a debugger, you can use the CPU's single-stepping capability to find instruction boundaries in a block of code. But only by running it, and if it faults (e.g. a load from an unmapped address) you'll get that exception instead of the TF single-step exception.

(Most OSes have facilities for attaching to and single-stepping another process, e.g. Linux ptrace, so you could maybe create an unprivileged sandbox process where your could step through some unknown bytes of machine code...)

Or as @Rbmn points out, you can use OS-assisted debug facilities to single-step yourself.

@Harold and @MargaretBloom also point out that you can put bytes at the end of a page (followed by an unmapped page) and run them. See if you get a #UD, a page fault, or a #GP exception.

#UD: the decoders saw a complete but invalid instruction.
page fault on the unmapped page: the decoders hit the unmapped page before deciding that it was an illegal instruction.
#GP: the instruction was privileged or faulted for other reasons.

To rule out decoding+running as a complete instruction and then faulting on the unmapped page, start with only 1 byte before the unmapped page, and keep adding more bytes until you stop getting page faults.

Breaking the x86 ISA by Christopher Domas goes into more detail about this technique, including using it to find undocumented illegal instructions, e.g. 9a13065b8000d7 is a 7-byte illegal instruction; that's when it stops page-faulting. (objdump -d just says 0x9a (bad) and decodes the rest of the bytes, but apparently real Intel hardware isn't satisfied that it's bad until it's fetched 6 more bytes).

HW performance counters like instructions_retired.any also expose instruction counts, but without knowing anything about the end of an instruction, you don't know where to put an rdpmc instruction. Padding with 0x90 NOPs and seeing how many instructions total were executed probably wouldn't really work because you'd have to know where to cut and start padding.

I'm wondering, why wouldn't Intel and AMD introduce an instruction for that

For debugging, normally you want to fully disassemble an instruction, not just find insn boundaries. So you need a full software library.

It wouldn't make sense to put a microcoded disassembler behind some new opcode.

Besides, the hardware decoders are only wired up to work as part of the front-end in the code-fetch path, not to feed them arbitrary data. They're already busy decoding instructions most cycles, and aren't wired up to work on data. Adding instructions that decode x86 machine-code bytes would almost certainly be done by replicating that hardware in an ALU execution unit, not by querying the decoded-uop cache or L1i (in designs where instruction boundaries are marked in L1i), or sending data through the actual front-end pre-decoders and capturing the result instead of queuing it for the rest of the front-end.

The only real high-performance use-case I can think of is emulation, or supporting new instructions like Intel's Software Development Emulator (SDE). But if you want to run new instructions on old CPUs, the whole point is that the old CPUs don't know about those new instructions.

The amount of CPU time spend disassembling machine code is pretty tiny compared to the amount of time that CPUs spend doing floating point math, or image processing. There's a reason we have stuff like SIMD FMA and AVX2 vpsadbw in the instruction set to speed up those special-purpose things that CPUs spend a lot of time doing, but not for stuff we can easily do with software.

Remember, the point of an instruction-set is to make it possible to create high-performance code, not to get all meta and specialize in decoding itself.

At the upper end of special-purpose complexity, the SSE4.2 string instructions were introduced in Nehalem. They can do some cool stuff, but are hard to use. https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (also includes strstr, which is a real use-case where pcmpistri can be faster than SSE2 or AVX2, unlike for strlen / strcmp where plain old pcmpeqb / pminub works very well if used efficiently (see glibc's hand-written asm).) Anyway, these new instructions are still multi-uop even in Skylake, and aren't widely used. I think compilers have a hard time autovectorizing with them, and most string-processing is done in languages where it's not so easy to tightly integrate a few intrinsics with low overhead.

installing a trampoline (for hotpatching a binary function.)

Even this requires decoding the instructions, not just finding their length.

If the first few instruction bytes of a function used a RIP-relative addressing mode (or a jcc rel8/rel32, or even a jmp or call), moving it elsewhere will break the code. (Thanks to @Rbmn for pointing out this corner case.)

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 26 '18 at 19:45

Peter Cordes

328,167
45
605
847

Yeah, thanks. The issue is executing those instructions, which I was hoping to avoid. But still it's an idea. – MikeF Jul 26 '18 at 20:03
I'm wondering, why wouldn't Intel and AMD introduce an instruction for that? Seems like a logical thing to do for debuggers, JIT'ers and similar type software. – MikeF Jul 26 '18 at 20:05
@MikeF: anything that *produces* code, like a JIT compiler, has to know the full encoding of every byte anyway, not just how long it is! For debugging, normally you want to fully disassemble an instruction, not just find insn boundaries. See my updated answer. – Peter Cordes Jul 26 '18 at 20:14
Well, Peter, there are some use cases that come to mind that doesn't require full disassembly. For instance, installing a trampoline (for hotpatching a binary function.) In that case one doesn't need to know the exact disassembly of a few instructions in the beginning of some function that is being patched. – MikeF Jul 26 '18 at 20:28
@MikeF: ok, good example. But not performance critical: you do that *once* and then the hook / wrapper runs lots of times. If it's not just a wrapper, then you have to know what instructions are present anyway to modify the function's behaviour instead of just adding new behaviour before running all of the function's original instructions. So many bugfix hotpatching cases don't benefit. – Peter Cordes Jul 26 '18 at 20:48
1

*So if you write a debugger, you can use the CPU's single-stepping capability* - this is not need, how minimum in windows. process can handle exception yourself. i not once do exactly this - set VEH (`AddVectoredExceptionHandler`) and set trace flag for thread. in `VectoredHandler` set again `TRACE_FLAG` and return `EXCEPTION_CONTINUE_EXECUTION` if want continue trace, or not set flag and return `EXCEPTION_CONTINUE_SEARCH` when we want stop self trace (exception handled already by SEH) – RbMm Jul 27 '18 at 11:27
1

@MikeF - `For instance, installing a trampoline (for hotpatching a binary function.)` - know only instruction length not enough. at first can be jmp/call at very begin of the function. so what that jmp can be only 2/5 bytes say ? this change instruction stream. also can be rip addressing instructions in very begin too. it can not be simply moved without fixup relative (rip) off set in instruction ( and can move only to +/- 2GB distance). so simply know instruction length is not enough for your task – RbMm Jul 27 '18 at 11:32
2

Relevant / Interesting [reading on how to determine instructions boundaries even for privileged/faulting instructions](https://www.blackhat.com/docs/us-17/thursday/us-17-Domas-Breaking-The-x86-Instruction-Set-wp.pdf). – Margaret Bloom Jul 27 '18 at 15:04
@RbMm: oh good point, a RIP-relative load of a constant is not rare as the first instruction, and `test/jcc` can often be found in the first 5 bytes. – Peter Cordes Jul 27 '18 at 15:45
@MargaretBloom: thanks, that's the same thing Harold mentioned but he didn't give a link. More interesting than I'd realized! I didn't know there were undocumented illegal instructions with length > 1. – Peter Cordes Jul 27 '18 at 15:45
even more frequently can be `jmp xxxx` (2 or 5 bytes) at first byte of function – RbMm Jul 27 '18 at 17:45
@RbMm Really? To somewhere else inside the same function, or tailcalling somewhere else? I almost never see that in regular functions in gcc / clang output, only in PLT stubs (where it's a 6-byte memory-indirect jmp with an address from the GOT). Is it often a jump into the middle of a loop? Otherwise that just sounds like silly a missed optimization bug. (Of course, laying out the function with the entry-point in the middle of the loop is possible so that could count as a missed optimization, too.) – Peter Cordes Jul 27 '18 at 18:36
Or does that mostly happen with functions that are already hooked / patched? (So of course you won't see that in compiler output). – Peter Cordes Jul 27 '18 at 19:01
this is in original code (not patch) and jump to another function. some time just jmp at first instruction, some time after zero argument. this is in ms binaries, created by *CL.EXE* - for fast example several functions - https://prnt.sc/kbvrs3 – RbMm Jul 27 '18 at 19:23
or another example with `jmp qword ptr [xxxxxxxx]` which is also rip relative in x64 - https://prnt.sc/kbvu44 - exist really very many such functions in kernel32.dll and kernelbase.dll - when dll is exported some function but really call it implementation from another dll. interesting that in some cases this is implemented via forward export (so no code) but some time (and most in latest windows builds) via such stub functions with single jmp. many kernel32 api forward functions via this to kernelbase or ntdll. kernelbase also forward to ntdll. in windows such jmp not rarelly case – RbMm Jul 27 '18 at 19:30
jump into the middle of a loop - very rarely, but exist too https://prnt.sc/kbvyhi – RbMm Jul 27 '18 at 19:37
else one example - https://prnt.sc/kbw27f by fact equal in body function - one with real code and another not merged with first by linker but inserted jmp xxx to first. dont sure why such code generated but exist – RbMm Jul 27 '18 at 19:47
Don't know how else to [post it](https://i.imgur.com/DWLgZ0h.png) here. – MikeF Jul 28 '18 at 01:01
@MikeF: If you want to post a big reply-to-comments like that, an edit to the question would be your best bet. Why `jmp rel8` and not `ret`? `jmp rel8` has a *relative* displacement from where the jump was. `ret` or `iret` simply pop into RIP or CS:RIP, and still work if copied somewhere else. – Peter Cordes Jul 28 '18 at 01:44
re: your suggested use-case of assisting debuggers: It's pretty marginal; if you can't fully decode all the instructions in some new code, normally you just update your debugger. It *only* helps if the CPU know more insns than the debugger, not the reverse. (For a debugger rather than disassembler, yes the CPU has to know them for you to be debugging this code in the first place. But if you single-step, you will cross the unknown instruction just fine using TF). – Peter Cordes Jul 28 '18 at 01:49
@MikeF: *If* it was cheap and natural to provide in hardware, then it would be reasonable. But it's really not. The CPU would need another length-decoder in an execution unit. My answer already has a whole section about why it's not microarchitecturally reasonable or cheap to add an execution unit and dedicate an opcode just for this. And would the insn need a variable-width memory operand (i.e. `m8`) and be heavily micro-coded? Or an XMM/m128 so a debugger could feed it bytes up to the end of a page padded with zeros? – Peter Cordes Jul 28 '18 at 01:52
@PeterCordes: I don't see what all the hoopla is about. The processor is already doing it for each and every instruction. This is the most basic operation for it. There's no need to waste any additional execution units. As for how to implement it, heck, it can be `decode rax` where `rax` is the effective address to decode the instruction from. (Basically replacing the internal `RIP`.) There's even no need to waste the R/M field. So `RAX` is updated afterwards to point to the next instruction. And use any reserved opcode for it. I'm sure, Intel is listening to this now :) How about `0F 01 C7`? – MikeF Jul 28 '18 at 04:00
@MikeF: Like I explained in my answer, there are serious obstacles to reusing the existing pre-decode instruction-length finding transistors. They're wired into the front-end with their outputs going to the 4 or 5 decoders (that turns instruction bytes into uops in parallel). Introducing a new way to feed them input (without *actually* jumping there and setting RIP) is also an obstacle for hardware design. It would likely take fewer total transistors to replicate instruction-length decoding hardware in an execution unit, and that would avoid competing with the code-fetch for cycles. – Peter Cordes Jul 28 '18 at 13:36
@MikeF: Your idea would make sense on a simple non-pipelined CPU where execution of an instruction could activate any existing part of the CPU core and capture the results easily. But that's not even close to how modern CPUs work. http://www.lighterra.com/papers/modernmicroprocessors/, and https://www.realworldtech.com/sandy-bridge/ – Peter Cordes Jul 28 '18 at 13:38

How to tell length of an x86-64 instruction opcode using CPU itself?

1 Answers1

Linked

Related