x86 32bit Assembly Parser | logical problem

Question

I'm currently working on an Obfuscator for assembled x86 assembly (working with the raw bytes).

To do that I first need to build a simple parser, to "understand" the bytes. I'm using a database that I create for myself mostly with the website: https://defuse.ca/online-x86-assembler.htm

Now my question: Some bytes can be interpreted in two ways, for example (intel syntax):

1. f3 00 00                repz add BYTE PTR [eax],al
2. f3                      repz

My idea way to loop through the bytes and work with every instruction as single, but when I reach byte '0xf3' I have 2 ways of interpreting it.

I know there are working x86 disassemblers out there, how do I know what case this is?

Both ways are invalid instructions, so I'm not sure why it matters. A `rep` prefix has to be followed by one of the specific instructions for which it's defined, and `add` isn't one of them. — Nate Eldredge, Sep 06 '21 at 18:21
Related: [How does an instruction decoder tell the difference between a prefix and a primary opcode?](https://stackoverflow.com/q/68898858) — Peter Cordes, Sep 06 '21 at 19:09
Also related: "mandatory prefixes" as part of encoding instructions like SSE2 `movdqa`: [Combining prefixes in SSE](https://stackoverflow.com/q/2404364) — Peter Cordes, Sep 06 '21 at 19:13

score 4 · Answer 1 · answered Sep 06 '21 at 18:27

4

Prefixes, including repz prefix, are not meaningful without subsequent instruction. The subsequent instruction may incorporate the prefix (repz nop is pause), change its meaning (repz is xrelease if used before some interlocked instruction), or the prefix may be just invalid.

The decoding is always unambiguous, otherwise the CPU could not execute instructions. It may be ambiguous only if you don't know exact byte offset where to begin decoding (as x86 uses variable instruction length).

answered Sep 06 '21 at 18:27

Alex Guteniev

12,039
2
34
79

1

*decoding is always unambiguous* - or at least, any given CPU will pick one way of decoding. Intel's manual says it's "illegal" to have multiple REX prefixes on one instruction, but their Skylake CPUs for example will take the last one ([like with other repeated prefixes](https://stackoverflow.com/questions/43433030/how-did-pentium-iii-cpus-handle-multiple-instruction-prefixes-from-the-same-grou/44366568#44366568)), not #UD fault. There is AFAIK no Intel documentation that says this is what will happen. But yes, they're still REX prefixes, so unambiguous in that sense. – Peter Cordes Sep 06 '21 at 19:08
1

Finally found the Q&A where I'd tested repeated REX prefixes: [Segmentation fault when using DB (define byte) inside a function](https://stackoverflow.com/a/55642776) – Peter Cordes Sep 06 '21 at 19:28
@PeterCordes Just clarifying, when parsing the subsequent instructions, all you need to do is look for the prefix bytes? To get all the bytes for an instruction, you simply go from the prefix to the the next prefix - 1? – Happy Jerry Aug 17 '22 at 15:16
@HappyJerry: Yeah, any number of prefixes can be part of one instruction. The first non-prefix byte is the opcode. (There's a length limit of 15 bytes per instruction, so #UD if you don't get to the end of an instruction before then, even if you've seen opcode + modrm which tell you how many more bytes of disp32 and/or imm32 there are.) – Peter Cordes Aug 17 '22 at 17:36
@PeterCordes So the delimiter for an instruction may look like: `current_position == prefix AND current_position -1 != prefix `. Once these conditions are met, I could assume that that I've reached the end of an instruction? – Happy Jerry Aug 17 '22 at 17:56
@HappyJerry: No, `mov eax, 0xf3f3f3f3` ends with 4 bytes that would be REP prefixes if decoding started there. x86 machine-code is a byte stream that's *not* self-synchronizing; you can't find the "correct" instruction boundaries if you don't know where to start. But given a starting point, you can decode forwards, starting the next decode at the end of the current instruction. Also, many instructions don't have prefixes. – Peter Cordes Aug 17 '22 at 18:00
@PeterCordes So, if we were to begin at `_start`, how are we able to determine the next instructions? – Happy Jerry Aug 18 '22 at 02:37
@HappyJerry: given a start point, you can always determine the end of the instruction, just like the CPU can. The tricky part is that some indirect branches may go to locations you don't know about, or branch backward into bytes that follow an unconditional `jmp`, so obfuscated code might trick a disassembler. Compiler output doesn't do that; it can simply be disassembled from top to bottom, decoding one instruction at the end of the previous. – Peter Cordes Aug 18 '22 at 02:58
This is discussed in lots of existing Q&As, e.g. [How does the CPU decode variable length instructions correctly?](https://stackoverflow.com/q/25129165) / [How does an instruction decoder tell the difference between a prefix and a primary opcode?](https://stackoverflow.com/q/68898858) / [Why does x/i on gdb give different results then disassemble?](https://stackoverflow.com/q/48739930) / [Is there a fundamental aspect of the x86-64 instruction set that would prevent one from fetching "backwards"?](https://stackoverflow.com/q/68550059) – Peter Cordes Aug 18 '22 at 02:58

x86 32bit Assembly Parser | logical problem

1 Answers1