Understanding Instruction Encoding?

Question

I used a website to encode this:

movw $8, 4(%r8d,%esi,4)

and got:

encoding (hex): 67 66 41 C7 44 B0 04 08 00

Thanks to you I nearly understand everything except 2 small points:

Here we are moving 2 bytes immediate to 4 bytes address. They used C7 opcode which according to the table I have means one of the following:

mov imm16 to r/m16
mov imm32 to r/m32
mov imm32 (sign extended) to r/m64

Why there is no match?

Why immediate is 2 bytes? according to what?

Nate Eldredge · Accepted Answer · 2021-07-31T19:29:55.580

1

There is a match. It's the first one "mov imm16 to r/m16", because of the w in the mnemonic movw. r/m16 means that 16 bits (two bytes) of memory are being read/written. It so happens that you are using a 32-bit effective address to identify which two bytes of memory are to be written, but that's not part of the r/m16 notation.

The immediate is two bytes because two bytes are to be written. There would be no point in having more. Though there are some examples, like the third case, where the immediate is shorter than the operand size and is zero- or sign-extended.

edited Jul 31 '21 at 19:29

answered Jul 31 '21 at 19:27

Nate Eldredge

48,811
6
54
82

But I don't get it we are summing 32 bit addresses so we get 32 bit address... w means write to 16 bit address so which 16 bit we take lower ones or higher ones? – Jul 31 '21 at 19:29
I think terminology like "32 bit address" is confusing you. The address is 32 bits, but we are using to identify 16 bits worth of memory. For instance, suppose `r8d + (esi * 4) + 4` comes out to equal `0x12345678`. Then your `movw` instruction will write `08` to the byte at address `0x12345678`, and write `00` to the byte at address `0x12345679`. Writing two bytes = 16 bits. If you used `movb`, only the byte at `0x12345678` would be written. If you used `movl`, the four bytes at `0x12345678..0x1234567b` would be written (with the values `08 00 00 00` respectively). – Nate Eldredge Jul 31 '21 at 19:33
@coolmo: Literally writing with a 16-bit address would mean writing the bytes starting at `0x5678` (it would always be the low bits). There is no encoding for this in 64-bit long mode, though there is in 32-bit protected mode (sort of, you become more limited in addressing modes). It is pretty much useless either way. – Nate Eldredge Jul 31 '21 at 19:36
Now it's clear, last thing there is an opcode for mov imm32 (sign extended) to r/m64 and another one for mov imm64 to r/m64 how may I know which to use (how may I know if the instruction does sign extension or not)? – Jul 31 '21 at 19:41
1

@coolmo: At the level of assembly you don't really care about the sign extension: you specify the actual value you want written. If the assembler can represent it as a sign-extended value, it will assemble it; otherwise pick a different encoding or complain. As for `mov imm64 to r/m64`, I think you are mistaken: no such encoding exists. There is an instruction to `mov imm64 to r64` (register only), `REX.W + B8`. The GNU assembler will automatically pick this encoding if you specify an immediate that does not fit in 32-bits sign extended, or you can force it with the `movabsq` mnemonic. – Nate Eldredge Jul 31 '21 at 19:47
@coolmo: For example `movq $0xfffffffffedcba98, %rax` will give you the "mov imm32 sign extended to r/m64` encoding, `REX.W + C7`. So will `movq $0xfffffffffedcba98, (%rax, %rsi, 8)`. However `movq $0xfedcba9876543210, %rax` will give you `REX.W + B8`, and `movq $0xfedcba9876543210, (%rax, %rsi, 8)` will fail to assemble. – Nate Eldredge Jul 31 '21 at 19:51
@coolmo: The official x86 terminology for these concepts are "address size" vs. "operand size". Those two attributes of an instruction are totally separate and orthogonal, and are controlled by different prefixes. – Peter Cordes Aug 01 '21 at 01:55
@coolmo: Unfortunately x86 doesn't have any mov-immediate that sign-extends a byte immediate (which would allow 3-byte mov to reg for small numbers), Only x86-64's REX.W version of `mov $imm32, r/m32`. The REX.W version of `mov $imm32, reg` (no ModRM) is special and takes a 64-bit immediate. Re: assemblers choosing automatically: if you use an symbolic address like `mov $func, %rdi`, GAS will default to `movq`. Only if the value is a *compile*-time (not link-time) constant can it choose movabs if needed. [Difference between movq and movabsq in x86-64](https://stackoverflow.com/q/40315803) – Peter Cordes Aug 01 '21 at 01:58

Understanding Instruction Encoding?

1 Answers1

Linked