Are more byte sequences valid than not when interpreted by an x86 CPU?

Question

This might seem an odd question, but I think it's relevant to certain x86 binary analysis hurdles. I'm pondering this idea: If I have a binary which reaches out to a remote server for a jump destination, but the server won't tell me the destination until December and I want to analyze the program's behavior in November, could I start jumping into random addresses within the program and reliably guess some correct instruction offsets? Would I potentially be able to "break into" the program's control flow structure and "bypass" the issue of that secret jump destination?

A security researcher, Chris Domas, demonstrated a program he calls Sand Sifter which attempts to fuzz x86 instructions: https://github.com/xoreaxeaxeax/sandsifter

I was thinking if invalid instructions are common, in other words if jumping into the middle of an instruction would often result in soon running into invalid instructions, could we find correct offsets?

Well that's probably far too nebulous to be a valid question, so I decided to simplify it: Roughly, do we have any estimate on what portion of all byte sequences are valid instructions? I suppose another way of asking might be: If I generated 1 million random entropy binary sequences, what percentage of them would have a valid execution path? Even that question is complicated by the length of the binary, since the longer the binary, the more likely to run into an invalid instruction eventually.

Most wrong guesses would lead to "valid" machine code, but nonsensical operations encoded in these instructions. To differentiate these from intended operations you need some level of intelligence. — ecm, Dec 19 '21 at 16:52
Most byte sequences are valid, especially in 32-bit mode (x86-64 removed several opcodes, so there are some 1-byte illegal instructions). The coding-space left for extensions is quite limited! Figuring out whether a start-point for execution / disassembling makes sense or not is more a matter of whether the sequence of instructions look sensible, not whether they're literally invalid. e.g. try `ndisasm -b64 /bin/ls` or whatever. It doesn't know about metadata so will try to disassemble all the bytes as machine code. Often you re-sync with compiler-generated instructions fairly soon. — Peter Cordes, Dec 19 '21 at 16:52
@ecm One thought I had to find bad offsets: If we executed all the way through the program to its termination (assuming we didn't hit an infinite loop), wouldn't the final bytes have a higher probability of being invalid if we're not following intended control flow? Instructions would be cut off half way on the final byte fetched / decoded, no? — J.Todd, Dec 19 '21 at 16:54
Not necessarily, like @Peter Cordes wrote you may at some point happen to get into sync with the intended instructions. Even if not, you need additional metadata to tell eg the end of an executable section, or (again) intelligence to determine that you're actually decoding past the end of a function (such as a table or string data). — ecm, Dec 19 '21 at 16:58
Related: [Is it possible to decode x86-64 instructions in reverse?](https://stackoverflow.com/q/52415761) - not unambiguously, no. Also related re: how crowded the coding space is: [How does an instruction decoder tell the difference between a prefix and a primary opcode?](https://stackoverflow.com/q/68898858). More directly for this Q: [How to tell if a binary sequence is x86 machine code?](https://stackoverflow.com/q/12027405) suggests ideas for heuristics. — Peter Cordes, Dec 19 '21 at 16:59
@PeterCordes "Often you re-sync with compiler-generated instructions fairly soon." That's an extremely interesting phenomenon, I didn't expect that to be the case. Very cool stuff. — J.Todd, Dec 19 '21 at 16:59
@ecm Oh, well how does the CPU normally know when to stop executing before hitting data? I assumed it was always due to an instruction or syscall which terminates execution of that thread. — J.Todd, Dec 19 '21 at 17:01
@J.Todd: You can play around with it yourself, for example you can tell GDB to start decoding from a specific byte offset, `disas /r (_start+2), (_start+100)` or something. In compiler-generated code for a normal program, immediates often have some `00` bytes or are otherwise simple, not full of bytes that would code for a long addressing mode as a ModRM. There are quite a lot of 1-byte x86 instructions and prefixes, too. — Peter Cordes, Dec 19 '21 at 17:05
@J.Todd: The CPU follows the path of execution, taking branches, and the program logic will only branch inside its code (`.text`) section, or to pages it JITs, or to libraries. Disassembly continues past unconditional branches, normally assuming that the next instruction will start right after that. (which is the case for non-obfuscated compiler output). — Peter Cordes, Dec 19 '21 at 17:06

Are more byte sequences valid than not when interpreted by an x86 CPU?

0 Answers0

Linked