Why does the Solaris assembler generate different machine code than the GNU assembler here?

Question

I wrote this little assembly file for amd64. What the code does is not important for this question.

        .globl fib

fib:    mov %edi,%ecx
        xor %eax,%eax
        jrcxz 1f
        lea 1(%rax),%ebx

0:      add %rbx,%rax
        xchg %rax,%rbx
        loop 0b

1:      ret

Then I proceeded to assemble and then disassemble this on both Solaris and Linux.

Solaris

$ as -o y.o -xarch=amd64 -V y.s                            
as: Sun Compiler Common 12.1 SunOS_i386 Patch 141858-04 2009/12/08
$ dis y.o                                                  
disassembly for y.o


section .text
    0x0:                    8b cf              movl   %edi,%ecx
    0x2:                    33 c0              xorl   %eax,%eax
    0x4:                    e3 0a              jcxz   +0xa      <0x10>
    0x6:                    8d 58 01           leal   0x1(%rax),%ebx
    0x9:                    48 03 c3           addq   %rbx,%rax
    0xc:                    48 93              xchgq  %rbx,%rax
    0xe:                    e2 f9              loop   -0x7      <0x9>
    0x10:                   c3                 ret

Linux

$ as --64 -o y.o -V y.s
GNU assembler version 2.22.90 (x86_64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.22.90.20120924
$ objdump -d y.o

y.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <fib>:
   0:   89 f9                   mov    %edi,%ecx
   2:   31 c0                   xor    %eax,%eax
   4:   e3 0a                   jrcxz  10 <fib+0x10>
   6:   8d 58 01                lea    0x1(%rax),%ebx
   9:   48 01 d8                add    %rbx,%rax
   c:   48 93                   xchg   %rax,%rbx
   e:   e2 f9                   loop   9 <fib+0x9>
  10:   c3                      retq

How comes the generated machine code is different? Sun as generates 8b cf for mov %edi,%ecx while gas generates 89 f9 for the very same instruction. Is this because of the various ways to encode the same instruction under x86 or do these two encodings really have a particular difference?

[Encoding ADC EAX, ECX - 2 different ways to encode?](https://stackoverflow.com/q/22217436/995714), [What is the “.s” suffix in x86 instructions?](https://stackoverflow.com/q/16746922/995714) — phuclv, Dec 06 '17 at 12:37
Also related: [x86 XOR opcode differences](https://stackoverflow.com/q/50336269) re: two choices of opcode for reg,reg instructions, patterns in opcodes (direction and size bits), and more links to related Q&As. — Peter Cordes, Aug 22 '21 at 18:08

score 6 · Accepted Answer · answered Jul 31 '13 at 14:29

6

Some x86 instructions have multiple encodings that do the same thing. In particular, any instruction that acts on two registers can have the registers swapped and the direction bit in the instruction reversed.

Which one a given assembler/compiler picks simply depends on what the tool authors chose.

answered Jul 31 '13 at 14:29

Drew McGowen

11,471
1
31
57

It's really cool how you cannot assume that the same assembly generates the same machine code across different assembler that are supposed to accept the same syntax. – fuz Jul 31 '13 at 14:33
You really can't even assume anything about what it generates, as long as what it does is the same. Hell, it could decide to implement `a + b` as `a - (0 - b)`. – Drew McGowen Jul 31 '13 at 14:35
Is it safe to assume the assembler will always pick the shortest encoding for the instruction I request? – fuz Jul 31 '13 at 14:38
Technically, I'd think not, but in most cases, yes. The catch here is things like relocations (like calls to other functions). These can usually be represented with 1- or 2-byte offsets, but relocations require it to be 4 bytes. – Drew McGowen Jul 31 '13 at 14:40
1

I remember that the A86 assembler use those types of instruction alternate encodings to generate a unique "fingerprint" for detecting unauthorized use. :-) – Brian Knoblauch Aug 01 '13 at 17:18

score 1 · Answer 2 · answered Aug 05 '13 at 20:42

1

You've not specified the operand size for the mov, xor and add operations. This creates some ambiguity. The GNU assembler manual, i386 Mnemonics, mentions this:

If no suffix is specified by an instruction then as tries to fill in the missing suffix based on the destination register operand (the last one by convention). [ ... ] . Note that this is incompatible with the AT&T Unix assembler which assumes that a missing mnemonic suffix implies long operand size.

This implies the GNU assembler chooses differently - it'll pick the opcode with the R/M byte specifying the target operand (because the destination size is known/implied) while the AT&T one chooses the opcode where the R/M byte specifies the source operand (because the operand size is implied).

I've done that experiment though and given explicit operand sizes in your assembly source, and it doesn't change the GNU assembler output. There is, though, the other part of the documentation above,

Different encoding options can be specified via optional mnemonic suffix. `.s' suffix swaps 2 register operands in encoding when moving from one register to another.

which one can use; the following sourcecode, with GNU as, creates me the opcodes you got from Solaris as:

.globl fib

fib:    movl.s %edi,%ecx
        xorl.s %eax,%eax
        jrcxz 1f
        leal 1(%rax),%ebx

0:      addq.s %rbx,%rax
        xchgq %rax,%rbx
        loop 0b

1:      ret

answered Aug 05 '13 at 20:42

FrankH.

17,675
3
44
63

1

The portion of the manual you refer to states that there is an amiguity if the size is not clear from the registers, which is not the case in an register-to-register move. The manual points at cases such as mov $12,(%eax), where it isn't clear what size is meant as that could both be a byte, a word, a longword or a quadword mov. – fuz Aug 05 '13 at 20:52
@FUZxxl the case you mentioned, `mov $..., (%...)` does _not_ have a "destination register operand" (as per the doc). It's _always_ ambiguous without the size (and/or a default assumption on operand size). – FrankH. Aug 06 '13 at 16:45
Yes. I did mention this as an example of an ambigous instruction. mov %eax,%ebx however is not ambigous. – fuz Aug 06 '13 at 19:52
1

The whole first half of this answer is bogus, and seems to be based on a misunderstanding of x86 machine code. The ModRM byte specifies *both* register operands. One in the /r field, and one in the /m fields (with the mode field specifying a register mode, rather than a memory operand, so there's no SIB byte or disp8 or disp32). The part you quoted doesn't implies anything about how the assembler chooses which operand is the /r and which is the /m, only about the operand size. The `.s` part of the answer is definitely interesting, though, I didn't know about that! – Peter Cordes Oct 20 '16 at 08:46
@Peter Cordes: You're correct that here, it's opcode ambiguity not implicit sizing: `0x8b` being a `MOV` with mem or register as _target_ (i.e. a 'store' if to-mem) while `0x8c` is a `MOV` with mem or register as _source_ (i.e. a 'load' if from-mem). For reg-reg transfers, there's two instructions that can do the same thing. Yet, while the R/M byte in both cases specifies reg-reg (leading two bits set), the operands are reversed, what's the source-reg for `0x8b` is the dest-reg for `0x89` and vice versa. The first one, `0xCF`, is `11-001-111` while the second one, `0xF9`, is `11-111-001`. – FrankH. Dec 22 '16 at 16:15
Yes, exactly my point. I'd suggest that you edit out any mention of operand-size, since having a choice of two opcodes for the same reg-reg instruction has nothing to do with operand-size, or `q/l/w/b` suffix, or any difference from the AT&T Unix assembler. (I didn't know that it always defaulted to `l` instead of looking at the register operands, so that's an interesting fact, but doesn't belong in this answer.) – Peter Cordes Dec 22 '16 at 20:15

Why does the Solaris assembler generate different machine code than the GNU assembler here?

Solaris

Linux

2 Answers2

Linked