why we can't move a 64-bit immediate value to memory?

Question

First I am a little bit confused with the differences between movq and movabsq, my text book says:

The regular movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers. This value is then sign extended to produce the 64-bit value for the destination. The movabsq instruction can have an arbitrary 64-bit immediate value as its source operand and can only have a register as a destination.

I have two questions to this.

Question 1

The movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers.

so it means that we can't do

movq    $0x123456789abcdef, %rbp

and we have to do:

movabsq $0x123456789abcdef, %rbp

but why movq is designed to not work for 64 bits immediate value, which is really against the purpose of q (quard word), and we need to have another movabsq just for this purpose, isn't that hassle?

Question 2

Since the destination of movabsq has to be a register, not memory, so we can't move a 64-bit immediate value to memory as:

movabsq $0x123456789abcdef, (%rax)

but there is a workaround:

movabsq $0x123456789abcdef, %rbx
movq    %rbx, (%rax)   // the source operand is a register, not immediate constant, and the destination of movq can be memory

so why the rule is designed to make things harder?

Note that `movq $0xFFFFFFFFFFFFFFFF, (%rax)` *is* encodeable because the top 32 bits match bit #32. All-F = all-ones which is the same as `-1` in 2's complement. Something like `0x12345678abcd` that has more than 32 significant bits would work as an example. (And be easier to grok than just leaving off one of the Fs.) — Peter Cordes, Jul 07 '20 at 10:36
Also note that GAS assembles `movq $0x123456789abcdef, %rbp` to the same machine code as `movabsq`. It notices that the number won't fit in a 32-bit immediate and automatically chooses 64-bit, because that's possible for a register destination. (It doesn't do that automatically for assemble-time constants that haven't been defined yet, or for addresses because addresses sometimes can be 32-bit. So writing `movabs` explicitly is still sometimes necessary.) All of that is unrelated to the actual question of why you can't have a memory destination, though. — Peter Cordes, Jul 07 '20 at 10:53
The short answer to why we can't is because it isn't provided for in the instruction set. A long answer would seek to justify why, but that really goes to design choices made long ago. — Erik Eidt, Jul 07 '20 at 13:05

Peter Cordes · Answer 1 · 2023-09-01T04:20:02.390

Yes, mov to a register then to memory for immediates that won't fit in a sign-extended 32-bit, unlike -1 aka 0xFFFFFFFFFFFFFFFF. The why part is interesting question, though:

Remember that asm only lets you do what's possible in machine code. Thus it's really a question about ISA design. Such decisions often involve what's easy for the hardware to decode, as well as encoding efficiency considerations. (Using up opcodes on rarely-used instructions would be bad.)

It's not designed to make things harder, it's designed to not need any new opcodes for mov, when AMD was extending x86 to 64-bit and aiming not need a whole separate decoder unit for different modes. And also to limit 64-bit immediates to one special instruction format. mov is the only instruction that can ever use a 64-bit immediate at all (or a 64-bit absolute address, for load/store of AL/AX/EAX/RAX).

Check out Intel's manual for the forms of mov (note that it uses Intel syntax, destination first, and so will my answer.) I also summarized the forms (and their instruction lengths) in Difference between movq and movabsq in x86-64, as did @MargaretBloom in answer to What's the difference between the x86-64 AT&T instructions movq and movabsq?.

Allowing an imm64 along with a ModR/M addressing mode would also make it possible to run into the 15-byte upper limit on instruction length pretty easily, e.g. REX + opcode + imm64 is 10 bytes, and ModRM+SIB+disp32 is 6. So mov [rdi + rax*8 + 1234], imm64 would not be encodeable even if there was an opcode for mov r/m64, imm64.

And that's assuming they repurposed one of the 1-byte opcodes that were freed up by making some instructions invalid in 64-bit mode (e.g. aaa), which might be inconvenient for the decoders (and instruction-length pre-decoders) because in other modes those opcodes don't take a ModRM byte or an immediate.

movq is for the forms of mov with a normal ModRM byte to allow an arbitrary addressing mode as the destination. (Or as the source for movq r64, r/m64). AMD chose to keep the immediate for these as 32-bit, same as with 32-bit operand size¹.

These forms of mov are the same instruction format as other instructions like add. For ease of decoding, this means a REX prefix doesn't change the instruction-length for these opcodes. Instruction-length decoding is already hard enough when the addressing mode is variable-length.

So movq is 64-bit operand-size but otherwise the same instruction format mov r/m64, imm32 (becoming the sign-extended-immediate form, same as every other instruction which only has one immediate form), and mov r/m64, r64 or mov r64, r/m64.

movabs is the 64-bit form of the existing no-ModRM short form mov reg, imm32. This one is already a special case (because of the no-modrm encoding, with register number from the low 3 bits of the opcode byte). Small positive constants can just use 32-bit operand-size for implicit zero-extension to 64-bit with no loss of efficiency (like 5-byte mov eax, 123 / AT&T mov $123, %eax in 32 or 64-bit mode). And having a 64-bit absolute mov is useful so it makes sense AMD did that.

Since there's no ModRM byte, it can only encode a register destination. It would take a whole different opcode to add a form that could take a memory operand.

From one POV, be grateful you get a mov with 64-bit immediates at all; RISC ISAs like AArch64 (with fixed-width 32-bit instructions) need more like 4 instructions just to get a 64-bit value into a register. (Unless it's a repeating bit-pattern; AArch64 is actually pretty cool. Unlike earlier RISCs like MIPS64 or PowerPC64)

If AMD64 was going to introduce a new opcode for mov, mov r/m, sign_extended_imm8 would be vastly more useful to save code-size. It's not at all rare for compilers to emit multiple mov qword ptr [rsp+8], 0 instructions to zero a local array or struct, each one containing a 4-byte 0 immediate. Putting a non-zero small number in a register is fairly common, and would make mov eax, 123 a 3-byte instruction (down from 5), and mov rax, -123 a 4-byte instruction (down from 7). It would also make zeroing a register without clobbering FLAGS 3 bytes.

Allowing mov imm64 to memory would be useful rarely enough that AMD decided it wasn't worth making the decoders more complex. In this case I agree with them, but AMD was very conservative with adding new opcodes. So many missed opportunities to clean up x86 warts, like widening setcc would have been nice. (Intel finally got around to this with APX providing REX2 and EVEX prefixes for a zero-upper form of setcc.) But I think AMD wasn't sure AMD64 would catch on, and didn't want to be stuck needing a lot of extra transistors and/or power to support a feature if people didn't use it.

Footnote 1:
32-bit immediates in general is pretty obviously a good decision for code-size. It's very rare to want to add an immediate to something that's outside the +-2GiB range. It could be useful for bitwise stuff like AND, but for setting/clearing/flipping a single bit the bts / btr / btc instructions are good (taking a bit-position as an 8-bit immediate, instead of needing a mask). You don't want sub rsp, 1024 to be an 11-byte instruction; 7 is already bad enough.

Giant instructions? Not very efficient

At the time AMD64 was designed (early 2000s), CPUs with uop caches weren't a thing. (Intel P4 with a trace cache did exist, but in hindsight it was regarded as a mistake.) Instruction fetch/decode happens in chunks of up-to-16 bytes, so having one instruction that's nearly 16 bytes isn't much better for the front-end than movabs $imm64, %reg.

Of course if the back-end isn't keeping up with the front-end, that bubble of only 1 instruction decoded this cycle can be hidden by buffering between stages.

Keeping track of that much data for one instruction would also be a problem. The CPU has to put that data somewhere, and if there's a 64-bit immediate and a 32-bit displacement in the addressing mode, that's a lot of bits. Normally an instruction needs at most 64-bits of space for an imm32 + a disp32.

BTW, there are special no-modrm opcodes for most operations with RAX and an immediate. (x86-64 evolved out of 8086, where AX/AL was more special, see this for more history and explanation). It would have been a plausible design for those add/sub/cmp/and/or/xor/... rax, sign_extended_imm32 forms with no ModRM to instead use a full imm64. The most common case for RAX, immediate uses an 8-bit sign-extended immediate (-128..127), not this form anyway, and it only saves 1 byte for instructions that need a 4-byte immediate. If you do need an 8-byte constant, though, putting it in a register or memory for reuse would be better than doing a 10-byte and-imm64 in a loop, though.

score 5 · Answer 2 · edited Jul 19 '22 at 17:30

For the first question:

From the official documentation of gnu assembler:

In 64-bit code, movabs can be used to encode the mov instruction with the 64-bit displacement or immediate operand.

mov reg64, imm (in intel syntax, destination first) is the only instruction that accepts a 64-bit immediate value as a parameter. That's why you can't write a 64-bit immediate value directly to memory, only to a register. That form of mov uses an opcode that includes a register number, rather than specifying a reg/mem destination via a ModRM byte.

For the second question:

For other destinations, for example a memory location, a 32-bit immediate can be sign-extended to a 64-bit immediate (which means the top 33 bits are the same there). In this case, you use the movq instruction.

This is also possible if the target is a register, saving 3 bytes:

48 B8 FF FF FF 7F 00 00 00 00   movabs $0x7FFFFFFF, %rax
48 C7 C0 FF FF FF 7F            movq   $0x7FFFFFFF, %rax

At the 64-bit immediate 0xFFFFFFFF, the top 33 bits are not the same (00...), so movl cannot be used here. That's why I chose 0x7FFFFFFF in this example. But there is another option:

When writing to a 32-bit register (the lower part of a 64-bit register), the upper 32-bit of the register are zeroed. For a 64-bit immediate whose upper 32-bits are zero, movl can therefore also be used, which saves another a byte:

# with mov $imm32, reg/mem32.  Assemblers won't use this for a register destination
C7 C0 FF FF FF FF               movl   $0xFFFFFFFF, %eax

A further byte is saved by the assembler using the special case mov-to-register encoding. (movabs-immediate is the REX.W form of this opcode.)

# the mov $imm32, reg  short-form encoding with no ModRM
B8 FF FF FF FF                  movl   $0xFFFFFFFF, %eax

GAS and other assemblers will automatically use the shortest encoding for the instruction you actually wrote, e.g. they'll encode mov $-1, %eax in 5 bytes.

But GAS does not automatically optimize %rax to %eax. For example, mov $0x00000000FFFFFFFF, %rax will use 10-byte movabsq, not movl.

It can also choose between movabs and movq if you use mov, depending on the size of the immediate. e.g. mov $1, %rax. But won't optimize that to a 5-byte mov-immediate with 32-bit operand-size.

But if you use as -Os (or or gcc -Wa,-Os), GAS will use the 5-byte movl $-1, %eax encoding for mov $0xFFFFFFFF, %rax. It has the same architectural effect (one instruction that makes RAX=0x00000000FFFFFFFF), but it's spelled differently in the asm source; using a different operand-size and thus a different register name.

NASM does this optimization (to a different operand-size) by default.

If fixed some bugs in your answer, have a look at the edit log message. Notably, GAS does *not* optimize `movq` to `movl`, only between `movq` and `movabsq` depending on the immediate. You might want to say something else. Your answer is now not wrong, but I'm not sure it's useful. — Peter Cordes, Jul 07 '20 at 09:50
Yep, and even mov RAX,0x8765432187654321 is going to get broken into 2 uop entries by the decoders. The microarchitectures are optimized for the common case, 32b and less. — Olsonist, Jul 08 '20 at 01:24
@Olsonist: Only in the uop cache. It's a single uop in the decoders and the issue stage (and ROB/RS). But yes, Agner Fog reports that imm64 (if the value isn't a zero-extended 32-bit value) takes 2 entries in a uop cache line, and maybe even takes an extra cycle to read from the uop cache. — Peter Cordes, Oct 06 '21 at 05:21

why we can't move a 64-bit immediate value to memory?

Question 1

Question 2

2 Answers2

Giant instructions? Not very efficient

Linked

Related