1

I'm trying to write a small 8086+ assembler, probably real mode only, and can settle for a large subset of the possible instructions.

The x86 instructions are complex and requires a complex table solution, which is fine but I want something smaller/simpler.

One of my ideas is to start with the opcodes and make an alternative set of mnemonics/addressing modes/registers that more closely relates to the actual machine instructions.

Has this been done and where can I read about it? My gut feeling says that this must have been done this already, but I can't find anything online.

Things I already looked into:

  • AT&T syntax: does not solve the problem, you still need a complex table lookup; in the end it is basically the same as Intel syntax.

  • CRASM512.ASM: a cool 512 bytes trick assembler. Very impressive, but not useable (and not meant to be). The syntax is still based on Intel, as well.

  • Using only a subset of "homogenously encoded" instructions. This is what I'm currently trying, and using a smaller and less complex table driven approach than a full-fledged x86 assembler.

    The problem is that I still need to check for invalid instructions, and x86 is complex enough that I can only make the table driven approach a little bit simpler, not simple. So it is 90% of the complexity for 10% of the result, because it's mostly the tables that changes compared to the real deal.

  • 2
    8086 assembler is easy compared to most platforms (and other languages), so what is your goal in doing this? If you just want to learn how, then I suggest building a simplified subset of a 8086 assembler and just keep expanding its capabilities until you have a full 8086 assembler. I was able to write a PIC16F assembler in about 1000 lines of Python (around 2008), and a 80286 assembler in about 2500 lines of C (around 2001), so these aren't large by any stretch. Are you trying to implement this in 8086 assembler? Even so, it should be fairly straightforward if you organize the code well. – Matt Jordan Apr 29 '16 at 20:16
  • Yes, a regular x86 assembler is straight forward (using a table driven approach) but that has already been done. I'd like to make mine MUCH smaller while still being usable/useful, so I'm looking for any corners to cut. Anyway, looking at the x86 instructions vs the mnemonics, the question about an alternative mnemonics set (etc) just begs to be asked. – Jonathan J. Bloggs Apr 29 '16 at 20:20
  • 1
    Ok, so you want to get closer to the metal than ... assembler? You realize it is bare metal, right? It is a representation of the numeric instruction encoding, which is as close as you can get. Maybe an example of what you want to be able to represent would help clarify this? – Matt Jordan Apr 29 '16 at 20:21
  • No, I don't want to get closer to the metal than assembler, I want an assembler syntax/mnemonics set set that more easily maps from source code to machine code. Or rather, I want to read the "prior art". (This is not a commercial project, obviously!) – Jonathan J. Bloggs Apr 29 '16 at 20:25
  • I also don't understand, PIC16F instruction encoding seems to be much more simple than x86. – Jonathan J. Bloggs Apr 29 '16 at 20:27
  • https://classes.soe.ucsc.edu/cmpe012/Summer08/notes/07_LC3_Assembly.pdf – Jose Manuel Abarca Rodríguez Apr 29 '16 at 20:29
  • 1
    What is your actual question? – Ross Ridge Apr 29 '16 at 20:30
  • The question went away during one of the edits. It was, paraphrasing: is there an assembler out there with alternative syntax, one that reflects the operation encoding more closely than either Intel or AT&T, e. g. with mnemonics that correspond one-to-one to opcodes? The OP is after an assembler that works as a series of table lookups. – Seva Alekseyev Apr 29 '16 at 20:34
  • @RossRidge Sorry, I did edit it out, added it back. Seva-Alexeyev: thanks, but not really, I'm after an **assembler syntax** that let's me write an assembler that DOESN'T use table lookups. Doesn't have to be one-to-one though; that is impossible. But less complex. I have an idea how it should look, but my immediate thought was that somebody clever have probably already done this, and probably better than I could. – Jonathan J. Bloggs Apr 29 '16 at 20:43
  • @JoseManuelAbarcaRodríguez thanks, but that is for a simulated LC3 processor, not x86. – Jonathan J. Bloggs Apr 29 '16 at 20:48
  • You said you wanted a simpler alternative. – Jose Manuel Abarca Rodríguez Apr 29 '16 at 20:49
  • 1
    Assembly language without table lookups is machine language. You need to lookup at least instruction mnemonics in table to find their opcode. – Ross Ridge Apr 29 '16 at 20:49
  • @RossRidge Yes, I have written assemblers before. For many processors, you don't need such complex table searches as for the x86, because they are simple. Take DSP processors for instance, you often can translate the mnemonic straight into an actual opcode byte (or nibble). On the x86, you need a complex set of table searching that almost forms a mini-DSL. A lot of that seems to stem from the mnemonics and addressing modes being somewhat "abstract" compared to some simpler machines. I don't think it is controversial that x86 instruction set is complex. – Jonathan J. Bloggs Apr 29 '16 at 20:56
  • I obviously do not mean that no lookup tables be in the program. I mean that I want to reduce a lot of the lookup logic by using an alternative instruction set abstraction (mnemonics, addressing modes, etc) that is simpler to assemble than the common Intel abstraction. A comparison: AT&T, and most 68k assemblers, places the burden on the programmer to specify register width by a suffix on every instruction (movw %1, $ax), whereas Intel/MASM abstracts that away (mov ax, 1), so assembler has to deal with it. That is a step in my direction, but I want to take it to it's logical extreme. – Jonathan J. Bloggs Apr 29 '16 at 21:05

1 Answers1

4

is a vastly-over-simplified architecture (for teaching purposes), but implements one of your ideas: instead of having a zillion different forms of mov that do fundamentally different things, it has different mnemonics for the three different mov-like opcodes it supports:

  • irmovl V, %rB: immediate -> reg
  • rmmovl %rA, D(%rB): reg -> memory (store)
  • mrmovl D(%rB), %rA: memory -> reg (load)

This is an AT&T-syntax flavour of y86, where the destination goes 2nd. AT&T syntax uses % and $ decorations to avoid confusion between reg names and symbols. IDK if that makes a parser smaller or larger.


Applying this idea to x86, you could use different mnemonics for different forms of the same instruction.

If you care more about easy-to-parse than human-readability and similarity with existing asm syntax, then you could always have operands listed in the order of encoding in the mod/rm byte. e.g.

addbir  al, 5    ; b = byte, i = immediate, r = register.  opcode 80 /0 with al encoded in the mod/rm byte, imm8
addbia  al, 5    ; a = ax/al:  opcode 04 imm8

; w=word, m=memory
addwrm  cx, 0, bx,    ; add cx, [0 + bx + (no index)]   encoding: 03 mod/rm
addwmr  cx, 0, , si   ; add [0 + (no base) + si], cx   encoding: 01 mod/rm

Note the last two lines: the first operand is always the "r" in the mod/rm byte, rather than the destination. It's sort of a text representation of the instruction encoding, not a human-usable syntax. I think this is the sort of idea you were aiming for?

Depending on how smart you want the assembler to be, you could have it pick between the imm8 and imm16 forms of immediate instructions. For disp8, disp16, or no displacement memory encodings, it might be easier to require a 0 instead of an empty entry.


Normally everyone wants a smart assembler that picks the best encoding for you (e.g. use the EAX-specific opcode that doesn't use a mod/rm byte). esp. for x86-64, avoiding REX prefixes when not necessary, or optimizing mov rax, 0x1234 into mov eax, 0x1234, is nice.

There would certainly be value in using different mnemonics for loads vs. mov-immediate, because that's a common source of confusion for asm beginners. (esp. since MASM and NASM syntax differ on what mov reg, symbol means).

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • This is exactly what I've been thinking, guess I'll have to spend some time finding patterns in the instruction encoding and figure out something clever. y86 seems like a great resource, never heard of it before, thanks! – Jonathan J. Bloggs Apr 29 '16 at 22:11
  • @JonathanJ.Bloggs: y86 is too over-simplified for anything but toy examples of baby-steps intro-to-asm classes. It doesn't even have multiply, divide, or even shift instructions (other than `add same,same` to left shift), so many things are impossible to implement efficiently. Some versions of it have `cmov`, [so you can at least emulate `setcc`](http://stackoverflow.com/questions/36585746/the-most-efficient-way-of-counting-positive-negative-and-zero-number-using-loop/36587614#36587614). It only has add,sub,and,xor, and no unsigned branch conditions (only signed). – Peter Cordes Apr 29 '16 at 22:25
  • Since two weeks have passed, I don't know if I should accept your answer; your answer, while helpful, mostly states things I had already thought of (and which are sort of implied in the question). So that's why I haven't yet marked your answer as "accepted". But thanks anyway! I only briefly glanced at the y86 docs, now I see you are right, it is much too basic. – Jonathan J. Bloggs May 08 '16 at 23:57
  • @JonathanJ.Bloggs: that's fine. If you don't end up using my idea for using mnemonics that are different for each encoding, then it didn't really answer your question, so you shouldn't mark it as accepted. The one thing that existing asm dialects for x86 don't do is provide a way (other than `db` directives) to disassemble into something that will assemble back to exactly the same bytes, not just something that will (usually) run the same. This is where I was going with my design, since that's a potentially useful thing to create. – Peter Cordes May 09 '16 at 00:04
  • 1
    Very good point about exact reassemblability, you'd think some brainiac would have thought of that by now and done the work. Like, uh, somebody at Intel :) – Jonathan J. Bloggs May 10 '16 at 00:13
  • @JonathanJ.Bloggs: I think ARM asm has syntax for specifying which encoding, when there's a choice. I haven't heard of anything like that for x86. Obviously x86 has a lot more choice in most cases. (e.g. disp8/disp32, rel8 vs. rel32 jumps, imm8 vs. imm32, `op r/m32, r32` vs. `op r32, r/m32` for most reg-reg instructions, ...) In most cases, just the instruction size matters, like for the PLT for shared libs where the lazy dynamic linking rewrites the `jmp` to the final destination after finding the address. – Peter Cordes May 10 '16 at 00:24
  • yeah, just meant that by now you'd think somebody at Intel would have created a (more logical) syntax like that for the x86, esp. because of reassemblability. – Jonathan J. Bloggs May 10 '16 at 00:58
  • @JonathanJ.Bloggs: Right, I agree. I'm a bit surprised it doesn't seem to exist for x86. – Peter Cordes May 10 '16 at 01:00