0

It's very common to listen to, or read comments like: "assembler is practically a machine code, but using symbols instead of direct binary codes".

My question is: "how much truth does such kind of affirmation hold in general?"

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
Daniel Bandeira
  • 360
  • 1
  • 2
  • 12
  • 3
    It's a human readable representation of machine code with some facilities to make it easier to write. There is a very close correspondence. – fuz Oct 28 '21 at 23:02
  • 1
    If you strictly mean the instructions only and furthermore ignore pseudoinstructions, linker fixups and other minor encoding variations it is pretty close, yes :) ARM manual for example is full of stuff like _"In certain circumstances, the assembler can substitute MVN for MOV, or MOV for MVN. Be aware of this when reading disassembly listings."_ – Jester Oct 28 '21 at 23:05
  • Read the processor manuals, the instruction set is couched in terms you’ll use in assembly language. While it’s not unreasonable to ask a question like this, a very quick examination of instruction sets will give you the answer. Have you done that? – DisappointedByUnaccountableMod Oct 28 '21 at 23:07
  • 1
    For most ISAs, there isn't much wiggle room in how to encode an asm source-level instruction. But for some, like x86-64, there are some choices. Assemblers pick the shortest, but even within that there's sometimes a choice. Usually this doesn't affect performance, e.g. [x86 XOR opcode differences](https://stackoverflow.com/q/50336269). Some assemblers have syntax to distinguish those variants, like [What is the ".s" suffix in x86 instructions?](https://stackoverflow.com/q/16746922) – Peter Cordes Oct 28 '21 at 23:13
  • 2
    Sometimes there is a perf difference, [Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?](https://stackoverflow.com/q/51664369) doesn't work for the no-modrm `adc al,0` special encoding, only for the 3-byte `adc r/m8, imm8` encoding. Or [How to force NASM to encode \[1 + rax\*2\] as disp32 + index\*2 instead of disp8 + base + index?](https://stackoverflow.com/q/48848230) – Peter Cordes Oct 28 '21 at 23:16

2 Answers2

2

If you start with a "plain text" representation of machine code (e.g. with opcodes/numbers replaced with mnemonics, addresses/numbers replaced with labels); then most assemblers also:

a) Allow more complicated expressions in operands (e.g. allow something like "mov eax,(1234*5+6)/7 where the assembler calculates the right value for a "mov eax,882" instruction).

b) Have preprocessors allowing you to write macros, etc. Often this includes conditional code, and sometimes it's powerful enough to allow you to create a new language and/or high level language constructs (e.g. imagine "while" and "endwhile" macros).

c) Can auto-select the most optimal encoding. For example, if the instruction could be encoded with a 32-bit immediate operand or an 8-bit immediate operand that is sign extended to 32 bits; then the assembler might look at the operand and determine if the shorter sign extended encoding will work.

All of these things make a huge difference for source code maintenance - e.g. if you add a few instructions somewhere you don't have to manually re-calculate all the addresses/offsets for call/jump/branch targets and data accesses; you can do a "#define COST_OF_CHEESE 123" in one place to make it easy to change later (without having to find everywhere the value was used); etc.

Brendan
  • 35,656
  • 2
  • 39
  • 66
1

Here's an example, using y86.

add %rdi, %rsi

Becomes something like

00 01 02

Basically, 00 is the byte representation of the opcode add. When the computer sees an add, it knows to interpret the next two bytes as registers (this is slightly more complex in x86). 01 and 02 are the byte 'names' or encodings for the registers %rdi and %rsi respectively.

Some parts of this example may not be entirely reflective of reality, but this is basically the correspondence between machine code and assembly. Instructions are opcodes + 1-5 bytes that are interpreted differently depending on the opcode.

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
squidwardsface
  • 389
  • 1
  • 7
  • Different ISAs work differently; most modern ISAs use fixed-with instructions, or a mix of 2-byte and 4-byte, with opcode being a bitfield within an instruction. x86 does (usually) use a whole byte or two as opcode, but the operand encoding is via a single ModRM byte which can signal additional bytes (SIB and/or constant displacement) for memory operands. x86 only has 8 registers, so two 3-bit fields plus a 2-bit addressing mode fit in one byte. (x86-64 uses a REX prefix to provide an extra bit to each reg field for 16 regs). It's silly for y86 to use two separate bytes to encode regs. – Peter Cordes Oct 28 '21 at 23:20
  • 1
    @PeterCordes You did not read the part of my answer where I very clearly stated that my example is meant to illuminate the general idea, not be 100% correct. You are also bikeshedding y86, which is nonsensical, as y86 was quite literally created for the purposes of pedagogy. – squidwardsface Oct 28 '21 at 23:30
  • 1
    The last sentence of your answer, *Instructions are opcodes + 1-5 bytes that are interpreted differently depending on the opcode.*, looked like it was intended to describe the general case of other ISAs, not just your y86 example in particular. That's why I commented. Re: y86 being simplistic: yes, but two halves of one byte would be pretty much as simple. Anyway, point being, real world ISAs don't waste that much coding space, so an example from a teaching ISA isn't representative of how real machine code packs fields together within bytes or words. – Peter Cordes Oct 28 '21 at 23:34
  • (And BTW, I'm not the downvoter on this answer. I don't think it deserved one. I don't think it's general enough that I'd want to upvote it either, but with some improvements to make it clearer that *all* of this is y86-specific, and the general case usually involves bitfields, often in fixed-width instructions, it would be decent.) – Peter Cordes Oct 28 '21 at 23:42
  • 1
    @PeterCordes Ah, I understand now. Before, I assumed that you had taken issue with the lax nature of my answer. Now I see that you were just providing real-world details as further elaboration. I sincerely apologize for being so abrasive and thank you for your patience. – squidwardsface Oct 29 '21 at 00:06
  • Yup, cheers, glad we got that sorted out. :) – Peter Cordes Oct 29 '21 at 00:32