14

I'm interested in writing an x86 assembler. I'm wondering what is a good way to map x86 assembly mnemonic instructions (using an Intel-like syntax) into the corresponding binary machine code instructions.

mudgen
  • 7,213
  • 11
  • 46
  • 46
  • also read http://stackoverflow.com/questions/2546715/how-to-analysis-how-many-bytes-each-instruction-takes-in-assembly/2761248#2761248 – claws May 11 '10 at 19:48

3 Answers3

11

Do you want to understand the physical mapping of mnemonics to machine code? If so volume 2A & 2B of the the Intel IA32/IA64 reference manuals describe the binary format of x86 machine code .

The x86 instruction set page on Wikipedia has a compact listing of all the instructions categorized by when they were introduced, which might help you prioritize what to implement first.

However, if you are asking about how to go about parsing an assembly code text file to get to the point where your program could start writing out machine code then you basically need to understand how to write a compiler. The tools lex and yacc are good places to start but if you don't know how build a compiler you'll also need to get a book. I think the Dragon book is the best one out there but there are any number of other books you could use, SO has plenty of recommendations.

P̲̳x͓L̳
  • 3,615
  • 3
  • 29
  • 37
Andrew O'Reilly
  • 1,645
  • 13
  • 15
  • You may not need a full fledged compiler for this. You need a simple two pass assembler with some sort of lookup table. You may not always generate the best code that way, but you'll get something that works. – Nathan Fellman May 04 '10 at 07:57
  • @Nathan: An assembler *is* a full-fledged compiler. Keep in mind, a compiler is just a translator from some other language to opcodes. If you count the entire process of translation, an assembler and a compiler end up doing exactly the same thing. It just happens that most other languages are complex enough, and assembly language simple enough, that assembly serves as a decent midpoint -- so lots of compilers separate the translation into two phases: translate to assembly, and have an assembler do the grunt work of actually generating opcodes. An assembler typically wouldn't benefit from this. – cHao Sep 11 '12 at 14:31
  • @cHao: Thanks for the clarification. That's an interesting point :-) – Nathan Fellman Sep 11 '12 at 18:49
  • 3
    The problem with all Intel Opcode references that they does not provide a single freakin' example how a particular assembly statement is mapped to machine code. It's very easy to get lost in that ModR/N SIB byte mess... – Calmarius Dec 19 '12 at 19:18
5

For x86, it's complicated as hell. A little less complicated since 32-bit processors took over, but yeah. Still a pain.

You may want to take a look at nasm ( http://www.nasm.us ). It's an open source 32-bit assembler. See how they do it. Or, use it instead. :)

cHao
  • 84,970
  • 20
  • 145
  • 172
1

It's just a straight-up one-to-one mapping; the Intel documentation describes all of the instructions and their encodings. You'll need to build a giant lookup table or something equivalent to do the matching and code generation.

Carl Norum
  • 219,201
  • 40
  • 422
  • 469
  • 9
    something tells me you never looked at x86 encoding. a single mnemonic can correspond to multiple opcodes, each opcode can have many prefixes, size overrides... and I'm sure I'm missing some stuff. – Bahbar May 03 '10 at 21:10
  • I write x86 assembly code every day. It has to be one-to-one, otherwise how do you know which opcode gets emitted for which instruction you wrote? Just because there are prefixes, special modifiers, memory access or registered versions, etc. doesn't change the fact that for each instruction you write in the assembly file you have to know what machine instruction gets emitted.... – Carl Norum May 03 '10 at 22:07
  • I take that back; it could be many-to-one, if you want to have multiple mnemonics generate the same machine instruction. It can't be one-to-many, though, unless you built some kind of context sensitivity into the assembler. The first case is unnecessary work, and the second case seems like a bad idea in general, so I'll let my answer stand. – Carl Norum May 03 '10 at 22:09
  • Look at this answer for examples of mappings that are one-to-many: http://stackoverflow.com/questions/2546715/how-to-analysis-how-many-bytes-each-instruction-takes-in-assembly/2761248#2761248 – Nathan Fellman May 04 '10 at 07:56
  • That answer talks about the context sensitivity I mentioned above. Point well taken, assembler directives are often used to handle such mappings. That said, the information about what instruction will be emitted I'd still available to the programmer. – Carl Norum May 04 '10 at 14:51
  • 4
    well, if you want to have more examples that don't have prefixes, mov eax, [ebx] and mov [eax], ebx don't use the same opcode (89 and 8b, I believe). The x86 encoding is really not a 1-1 mapping with _mnemonics_. Yes, the assembler has all the data it needs to generate the assembly from the source. Not just from mnemonics is all I was saying. – Bahbar May 04 '10 at 18:45
  • 1
    A big lookup table is probably the way to go these days. Back in the 1980s I'm sure there was a much fancier method used that took advantage of particular patterns to save space. Because 640kB is not all that much room. Of course they only had to deal with 16 bit instructions... – Zan Lynx Dec 24 '14 at 02:18
  • 1
    @Bahbar `mov eax, [ebx]` is `8b 03` and `mov [eax], ebx` is `89 18`. That may look different but if you look actually in binary/octal **it's the same opcode**. `213 003` vs `211 030`. The first byte contains 6-bit opcode (100010b - 21o) + direction bit (d = 1: move to eax/d = 0: reverse) + s = 1 (32-bit operations). The next byte is mod = 0 (no displacement) in the first 2 bits and 2 registers swapped (03 vs 30) in the remaining 6 bits – phuclv Feb 24 '18 at 08:26