1

Is it possible for a Length Disassembler to correctly identify the size of an instruction without supporting extensions like VEX/EVEX/MVEX/XOP prefix'd instructions?

I am asking because I have written a basic Length Disassembler, but it only supports (x86 and x86_64):

  • General/System Instructions
  • x87 FPU
  • MMX
  • SSE
  • SSE2
  • SSE3
  • SSE4.1
  • SSE4.2
  • VMX
  • SMX

It would take quite a bit to incorporate every single extension that Intel/AMD includes + actually checking if the CPU in question doing the Disassembly actually supports each of those instructions, while AFAIK the majority of CPUs support the extensions above.

So are there any cases where an unsupported instruction would cause the series of bytes to be interpreted as another instruction I do support which would be a different size and would mess up all of the following instructions?

I could indeed browse trough the lengthy Intel and AMD manuals and do some thinking, but if anybody here can give me a straight answer quickly based on their knowledge, I would prefer it. Thank you.

Carol Victor
  • 331
  • 1
  • 7
  • 1
    How happy are you with your length disassembler returning the wrong value, then length disassembling gibberish? I think that is the real question. – mevets Feb 06 '21 at 01:55
  • @mevets Not happy, mate, but without knowing for sure (As I am not aware of how the Extensions actually work and their ISA) I figured there was a chance that some of the unsupported instructions would either not disassemble to anything or would disassemble to something of the same size. – Carol Victor Feb 06 '21 at 02:00
  • 1
    The unsupported instructions (at least VEX and EVEX) would not disassemble to anything as they use invalid encodings. Assuming you detect those properly you can stop with an error. – Jester Feb 06 '21 at 02:11
  • @CarolVictor Yeah, I thought I was being funny. I just looked at the tables... maybe your project would prefer arm or riscv? – mevets Feb 06 '21 at 02:26
  • 1
    Yes, there can be cases. If you don't know the right byte to start decoding from, x86 machine code is a byte-stream that's not self-synchronizing (vs. something like UTF-8 which is). For an instruction you don't recognize, you won't know what length to assume. Real-world disassemblers typically pick 1 byte (there are some invalid 1-byte opcodes in 64-bit mode), but obviously that's wrong for the `c4` or `c5` byte that starts a VEX prefix. – Peter Cordes Feb 06 '21 at 02:27
  • You don't need CPU support for software decoding / disassembly of an instruction. e.g. I can assemble / disassembler AVX-512 instructions just fine on Skylake, or on an old Core2 for that matter, or even on an ARM CPU. Unless that checking is somehow required for your use-case to figure out how the current CPU would decode those bytes? – Peter Cordes Feb 06 '21 at 09:31
  • @PeterCordes I do know that is how most disassemblers operate, but I cannot personally understand why you would decode an instruction that your machine cannot run, it will give you inaccurate results as to what your CPU is actually going to do with the data you feed it, so I see doing that as a necessity. – Carol Victor Feb 06 '21 at 09:37
  • 1
    If you're building an AVX-512 binary that you plan to copy to a different machine and run it there, IDK why you'd want your local tools to choke on it. A disassembler might want to mark SSE4.1, AVX1 / AVX2 / AVX-512 or whatever in comments on each instruction that wasn't baseline i386 or baseline x86-64, but it would make zero sense to me to just decode it as "bad" with unknown length if you're using an old (or non-x86!) machine to look at binaries from or intended for a different machine. e.g. glibc contains AVX2 code in `memcmp` that it only runs on AVX2 CPUs, but it's still there. – Peter Cordes Feb 06 '21 at 09:52
  • You're right, I did not take into account anything other than my own situation. I don't need it for analysis of binaries, instead I will use it for code modification and dynamic analysis that must be done relative to the machine it runs on. – Carol Victor Feb 06 '21 at 10:14

1 Answers1

0

Disassembler for x86-64 which doesn't recognize *VEX encoding is a scrap; I think the answer of your question is negative.

Luckily, dissection of the new prefixes is not that difficult, see AVX512 Chapter 4, especially if you already have managed to disassemble SSE* instruction sets.

Netwide Disassembler can help you a lot (though it suffers with some bugs). You can use it to verify whether your product works correctly. Per aspera ad astra!

vitsoft
  • 5,515
  • 1
  • 18
  • 31
  • It's not detecting and decoding the prefixes themselves that's hard, it's getting the length exactly right for all the multitude of opcodes that can be encoded with them. Some take an immediate and some don't. Is there any consistent pattern to immediate-or-not that Intel built-in to VEX/EVEX-prefixed opcodes to make length-decoding easier? At least for opcodes that aren't versions of legacy SSE instructions that you'd already support anyway; I assume those are irregular and have to be handled on a case-by-case basis, unless patterns date back to P5-MMX / Pentium III days. – Peter Cordes Feb 06 '21 at 09:04
  • I have read up on THE VEX, EVEX AND MVEX prefixes and I have to say, it is a big mess. It entangles with everything, up to the point where they give new purpose to some of the ModR/M fields (Unless I misunderstood) and create a whole new SIB structure and change the meaning of some of the prefix's fields based on the number of operands. – Carol Victor Feb 06 '21 at 09:35
  • @CarolVictor: VEX / EVEX have an extra bit to extend some ModRM/SIB fields the same way REX prefixes can, that's nothing new. (Having a whole separate 3rd operand is new, but it's only a register). But that doesn't change instruction *length* decoding; you still only need to check the opcode and standard modrm (to see if it implies a sib and/or disp8/32). REX for more registers was carefully designed *not* to change length-decoding of ModRM: [Neither rbp nor r13 can be a base with no disp](https://stackoverflow.com/q/52522544). VEX/EVEX just provides the same extra bits for base / idx. – Peter Cordes Feb 07 '21 at 01:48