2

When we use the mov instruction in assembly the source and the destination operands must be of the same size. If i write:

mov rax, 1

Is the 1 operand converted respecting the size of rax register ?

For example, if rax is 16 bit we get:

0000000000000001

?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Koinos
  • 151
  • 3
  • 14
  • RAX is 64-bit. In 64-bit mode and generally speaking, immediates are either 32 bits (sign or zero extended) or 64 bits. – Margaret Bloom Oct 31 '18 at 10:30
  • @MargaretBloom: immediates are always sign-extended to the operand-size for opcodes that use narrow immediates. At least I can't think of any where they're zero-extended. If you want zero-extension, you have to use `mov eax, imm32` which has 32-bit operand size and follows the usual rule of writing a 32-bit register zero-extending to fill the 64-bit register. [Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?](https://stackoverflow.com/q/11177137). (I assume that's what you meant, but if we consider narrower operand-size then 16 and 8 bits are – Peter Cordes Oct 31 '18 at 10:41
  • There is specific instruction `movabs rax,<64b immediate>` which will contain encoded value "1" as 64b integer, but common modern assembler NASM will for example `mov rax,1` assemble into instruction `mov eax,1` with 32b immediate (machine code `b8 01 00 00 00`), which will set up the final `rax` content in the exactly same way, but the encoding is much shorter. .. Anyway, if the instruction has `rax` as target register, then you can bet whatever operation is going on, will target whole 64 bits of target register. How/if the operand is extended depends on particular instruction and operand. – Ped7g Oct 31 '18 at 10:43
  • Anyway, @koinos: see also [Difference between movq and movabsq in x86-64](https://stackoverflow.com/q/40315803) and [What's the difference between the x86-64 AT&T instructions movq and movabsq?](https://stackoverflow.com/q/52434073) for more about MOV specifically, because MOV is special; it's the only instruction that can use a 64-bit immediate so there are multiple ways an assembler can choose to encode this asm source into machine code. – Peter Cordes Oct 31 '18 at 10:43
  • @PeterCordes Yes, but most assemblers will encode `mov rax, X` as `mov eax, X` if possible, so it's useful to think as them as zero-extended (e.g. `mov rax, 0xf0000000` is only 5 bytes) though that's not technically 100% correct. – Margaret Bloom Oct 31 '18 at 11:08
  • @MargaretBloom: in overall effect yeah. I realize I'm just being pedantic, but the question did seem to maybe be asking about the technicalities of operand-size. It's ambiguous what it's really trying to ask. But I wouldn't say "most". NASM will, but YASM won't, and neither will GAS `.intel_syntax`. I don't know about FASM or MASM, and I haven't checked clang/LLVM `.intel_syntax` to see if it does the assemble-time optimization to a different operand-size, so the resulting asm doesn't explicitly reference RAX anymore. – Peter Cordes Oct 31 '18 at 11:19
  • @PeterCordes Oh, I didn't know that was mostly a NASM feature. – Margaret Bloom Oct 31 '18 at 11:25
  • @MargaretBloom: I think YASM considers it a missed optimization, but I'm not sure if GAS developers would accept a patch if anyone sent one. I haven't looked to see whether they wished did that or not. I'd guess maybe not, because being able to get more different encodings (for code-alignment purposes) is a feature, and unless they added a way to override it back to the 7-byte encoding you'd lose that. I'm now curious about MASM and FASM, because if they do it then it wouldn't be fair to say it's mostly a NASM feature. – Peter Cordes Oct 31 '18 at 11:48
  • 1
    @PeterCordes FASM seems to do it, let me see if I can get my hands on a copy of MASM. – Margaret Bloom Oct 31 '18 at 11:59
  • 1
    No, sorry @Peter, FASM doesn't do it but MASM does. – Margaret Bloom Oct 31 '18 at 12:12

1 Answers1

4

There are 2 languages. The first one is assembly language, where you might have a string of characters like "mov rax,1". The second one is machine language where you'll have a set of bytes.

These languages are related, but different. For example, the mov instruction in assembly language is actually multiple different opcodes in machine language (one for moving bytes to/from general purpose registers, one for moving words/dwords/qwords to general purpose registers, one for moving dwords/qwords to control registers, one for moving dwords/qwords to debug registers, etc). The assembler uses the instruction and its operands to select an appropriate opcode (e.g. if you do mov dr6,eax then the assembler will choose the opcode for moving dwords/qwords to debug registers because none of the other opcodes are suitable).

In the same way, the operands may be different. For example, for assembly language the constant 1 has the type "integer" and doesn't have any size (its size is implied from how/where its used); but in machine code an immediate operand must be encoded somehow, and the size of the encoding will depend on which opcode (and which prefixes) are used for the mov.

For example, if mov rax,1 is converted into the bytes 0x48, 0xC7, 0xC0, 0x01, 0x00, 0x00, 0x00; then you could say that the operand is "64 bits encoded in 4 bytes (using sign extension)"; or you could say that the operand is 32 bits encoded in 4 bytes (and that the instruction only moves 32 bits into RAX and then sign extends into the upper 32 bits of RAX instead of moving anything into them). Even though these things sound different (and even though most people would say the latter is "more correct") the behaviour is exactly the same and the only differences are superficial differences in how machine code (a different language that isn't assembly language) is described. In assembly language, the 1 is still an ("implied from context") 64 bit operand, regardless of what happens in machine language.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • There are two opcodes for moving bytes to/from GP registers (one for each direction). Or a third if you include `mov reg, imm`: moving immediate bytes. And actually twice that many, because there's a different opcode for 8-bit operand size vs. 16/32/64 (same opcode with different prefixes). But anyway "one opcode for mov to/from GP regs" is definitely not literally correct. A link to http://felixcloutier.com/x86/MOV.html is a good idea here. – Peter Cordes Oct 31 '18 at 11:57
  • Your last paragraph reads like it's describing the `b8 01 00 00 00` (`mov eax,imm32`) encoding. The 7-byte encoding you show uses a sign-extended 32-bit immediate and definitely does write the full RAX with 64-bit operand-size, not leaving them to implicit zero-extension. Would love to +1 this answer once you fix this bug, I think you've accurately identified the OP's confusion between asm and machine code. – Peter Cordes Oct 31 '18 at 12:00
  • @PeterCordes: You can describe it as 2 different 8-bit opcodes, or you can describe it as one 7-bit opcode that has an extra "direction" parameter shoved into (lowest) bit, or you can describe it as 32 different 16-bit opcodes (one for each register, for each direction), or... It's all the same regardless of how someone describes it. – Brendan Oct 31 '18 at 12:12
  • That's fair; Intel defines it the way I described, but your way of looking at it makes sense, too. I assume you're working on an update for the other points, especially the last paragraph. – Peter Cordes Oct 31 '18 at 12:21
  • @PeterCordes: For the encoding I used in the example, I just put "mov rax,1" into an online assembler (at https://defuse.ca/online-x86-assembler.htm ) and copied what it said. There's different ways an assembler could choose to encode the same instruction and it doesn't really matter much which is used in the example. Note that this can actually have practical uses - e.g. I know of at least one assembler (a86 and a386) that uses it for anti-piracy (so the author of the assembler can tell if their assembler was/wasn't used based on which encodings were chosen). – Brendan Oct 31 '18 at 12:21
  • The encoding you got is `mov r/m64, sign_extended_imm32`. Your description is all wrong for it, but fits perfectly for the optimization NASM and MASM do, of optimizing it to `mov eax, imm32` – Peter Cordes Oct 31 '18 at 12:24
  • @PeterCordes: Ah - I see what you're saying now (sorry - fixed now!) – Brendan Oct 31 '18 at 12:26