172

In the x86-64 Tour of Intel Manuals, I read

Perhaps the most surprising fact is that an instruction such as MOV EAX, EBX automatically zeroes upper 32 bits of RAX register.

The Intel documentation (3.4.1.1 General-Purpose Registers in 64-Bit Mode in the manual Basic Architecture) quoted at the same source tells us:

  • 64-bit operands generate a 64-bit result in the destination general-purpose register.
  • 32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register.
  • 8-bit and 16-bit operands generate an 8-bit or 16-bit result. The upper 56 bits or 48 bits (respectively) of the destination general-purpose register are not be modified by the operation. If the result of an 8-bit or 16-bit operation is intended for 64-bit address calculation, explicitly sign-extend the register to the full 64-bits.

In x86-32 and x86-64 assembly, 16 bit instructions such as

mov ax, bx

don't show this kind of "strange" behaviour that the upper word of eax is zeroed.

Thus: what is the reason why this behaviour was introduced? At a first glance it seems illogical (but the reason might be that I am used to the quirks of x86-32 assembly).

Nubok
  • 3,502
  • 7
  • 27
  • 47
  • 22
    If you Google for "Partial register stall", you'll find quite a bit of information about the problem they were (almost certainly) trying to avoid. – Jerry Coffin Jun 24 '12 at 14:38
  • 4
    http://stackoverflow.com/questions/25455447/x86-64-registers-rax-eax-ax-al-overwriting-full-register-contents – Hans Passant Aug 27 '15 at 07:16
  • 6
    Not just "most". AFAIK, *all* instructions with an `r32` destination operand zero the high 32, rather than merging. For example, some assemblers will replace `pmovmskb r64, xmm` with `pmovmskb r32, xmm`, saving a REX, because the 64bit destination version behaves identically. Even though the [Operation section of the manual](http://www.felixcloutier.com/x86/PMOVMSKB.html) lists all 6 combinations of 32/64bit dest and 64/128/256b source separately, the implicit zero-extension of the r32 form duplicates the explicit zero-extension of the r64 form. I'm curious about the HW implementation... – Peter Cordes May 26 '16 at 23:38
  • 3
    @HansPassant, the circular reference begins. – kchoi Jul 15 '16 at 23:26
  • 1
    Related: [`xor eax,eax` or `xor r8d,r8d` is the best way to zero RAX or R8](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and) (saving a REX prefix for RAX, and 64-bit XOR isn't even handled specially on Silvermont). Related: [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to) – Peter Cordes Nov 21 '17 at 23:44

4 Answers4

127

I'm not AMD or speaking for them, but I would have done it the same way. Because zeroing the high half doesn't create a dependency on the previous value, that the CPU would have to wait on. The register renaming mechanism would essentially be defeated if it wasn't done that way.

This way you can write fast code using 32-bit values in 64-bit mode without having to explicitly break dependencies all the time. Without this behaviour, every single 32-bit instruction in 64-bit mode would have to wait on something that happened before, even though that high part would almost never be used. (Making int 64-bit would waste cache footprint and memory bandwidth; x86-64 most efficiently supports 32 and 64-bit operand sizes)

The behaviour for 8 and 16-bit operand sizes is the strange one. The dependency madness is one of the reasons that 16-bit instructions are avoided now. x86-64 inherited this from 8086 for 8-bit and 386 for 16-bit, and decided to have 8 and 16-bit registers work the same way in 64-bit mode as they do in 32-bit mode.


See also Why doesn't GCC use partial registers? for practical details of how writes to 8 and 16-bit partial registers (and subsequent reads of the full register) are handled by real CPUs.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
harold
  • 61,398
  • 6
  • 86
  • 164
  • 9
    I don't think it's strange, I think they didn't want to break too much and kept the old behavior there. – Alexey Frunze Jun 24 '12 at 11:56
  • 8
    @Alex when they introduced 32bit mode, there was no old behaviour for the high part. There was no high part before.. Of course after that it couldn't be changed anymore. – harold Jun 24 '12 at 11:59
  • 2
    I was speaking about 16-bit operands, why the top bits don't get zeroed in that case. They don't in non-64-bit modes. And that's kept in 64-bit mode too. – Alexey Frunze Jun 24 '12 at 12:04
  • 4
    I interpreted your "The behaviour for 16bit instructions is the strange one" as "it's strange that zero-extension doesn't happen with 16-bit operands in 64-bit mode". Hence my comments about keeping it the same way in 64-bit mode for better compatibility. – Alexey Frunze Jun 24 '12 at 12:09
  • 9
    @Alex oh I see. Ok. I don't think it's strange from that perspective. Just from a "looking back, maybe it wasn't such a good idea"-perspective. Guess I should have been clearer :) – harold Jun 24 '12 at 12:12
  • 3
    The logic for 16-bit commands can be "If we have to keep compatibility and so dependency on bits 16-31 of the previous register value, clearing bits 32-63 won't save us. So, omit this clearing totally." This isn't the most x86-64 weirdness, anyway. – Netch Oct 31 '16 at 10:32
  • Why would the CPU have to wait ? Because it would have to read the two useless upper bytes in order to write the whole register ? – Bilow Nov 07 '16 at 21:45
  • @Bilow yes, because of the register renaming it can't just do a partial write, the old high bytes aren't in the new physical register yet so they have to be copied over, which means the operation has to wait until that value is produced. There is some trickery in existence that avoids this, for example by renaming the low part as if it existed on its own, and then when the dword is read and there is some separated low part insert an µop to merge the parts. Most Intel µarchs do that, except Netburst. AMD doesn't do that. – harold Nov 07 '16 at 21:58
  • Interesting link for GCC. The fact is my 64bit code generated by GCC includes al, bl, cl, dl... but in general those are for quite particular cases such as a `SETcc` instruction. So they are being used in some rare cases. – Alexis Wilke Nov 21 '20 at 04:16
  • ```Because zeroing the high half doesn't create a dependency on the previous value, that the CPU would have to wait on``` - explicitly declaring the upper bits as *undefined* would be even easier, and probably even faster, than zeroing though? – hanshenrik Dec 07 '22 at 13:48
  • @hanshenrik it's easy to define, but it does not seem very useful to me. What would that allow that is *actually* faster than zeroing? OTOH it would mean software has to include more zero-extension instructions, so it's not free to undefine the upper half like that. – harold Dec 07 '22 at 21:43
13

It simply saves space in the instructions, and the instruction set. You can move small immediate values to a 64-bit register by using existing (32-bit) instructions.

It also saves you from having to encode 8 byte values for MOV RAX, 42, when MOV EAX, 42 can be reused.

This optimization is not as important for 8 and 16 bit ops (because they are smaller), and changing the rules there would also break old code.

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
  • 7
    If that's correct, wouldn't it have made more sense for it to sign-extend rather than 0 extend? – Damien_The_Unbeliever Jun 24 '12 at 11:54
  • @Damien_The_Unbeliever Possibly. But zero-extension is extremely cheap. – Alexey Frunze Jun 24 '12 at 11:57
  • 2
    @Alex: And sign-extension isn't? Both can be done very cheaply in hardware. – jalf Jun 24 '12 at 11:59
  • 1
    In x32 assembly this is what the MOVZX instruction is for. So I don't believe this is the final answer. – Nubok Jun 24 '12 at 12:01
  • @Damien - Probably not. AMD collected lots of statistics from existing programs when designing the x64 instruction set. One goal was to keep it as compact as possible, to save on program size (considering cache and memory bandwidth). – Bo Persson Jun 24 '12 at 12:02
  • @jalf zero-extension is cheaper than sign-extension, not by much, but still. – Alexey Frunze Jun 24 '12 at 12:05
  • 2
    @Alex: no it's not. It would be a bit slower if done in software, sure, but in hardware, it'd, at worst, cost a few more transistors, which, on a chip the size and complexity of a modern CPU, that's really not an issue. – jalf Jun 24 '12 at 14:03
  • 19
    Sign extension is slower, even in hardware. Zero extension can be done in parallel with whatever computation produces the lower half, but sign extension can't be done until (at least the sign of) the lower half has been computed. – Jerry Coffin Jun 24 '12 at 14:26
  • 14
    Another related trick is to use `XOR EAX, EAX` because `XOR RAX, RAX` would need an REX prefix. – Neil Oct 02 '13 at 09:12
  • 4
    @Nubok: Sure, they could have added an encoding of movzx / movsx that takes an immediate argument. Most of the time it's *more* convenient to have the upper bits zeroed, so you can use a value as an array index (because all regs have to be the same size in an effective address: `[rsi + edx]` isn't allowed). Of course avoiding false dependencies / partial-register stalls (the other answer) is another major reason. – Peter Cordes Dec 18 '15 at 02:51
  • I've seen instructions like `mov (rax), rax` where the both the operands are `rax`. What does it mean to `mov` from `rax` to `rax`? I feel the parenthesis has special meaning here. – Nawaz Jan 19 '17 at 13:12
  • 2
    @Nawas - `(rax)`, or `[rax]` depending on the assembler, is similar to pointer dereferencing, so it would load a value from the address in `rax` and replace the pointer with the value loaded. – Bo Persson Jan 19 '17 at 13:27
  • 2
    @Damien_The_Unbeliever: `mov r/m64, sign_extended_imm32` (7 bytes) is available with a REX.W prefix for the `mov r/m32, imm32` opcode (https://www.felixcloutier.com/x86/mov). The no-ModRM mov-to-reg encodings are either 5-byte `mov r32,imm32` or 10-byte `mov r64, imm64` without/with REX.W. Also, the last section of my answer on [MOVZX missing 32 bit register to 64 bit register](//stackoverflow.com/q/51387571) discusses the merits of x86-64 style implicit zero-extension vs. MIPS64 requiring correctly sign-extended inputs/outputs for 32-bit operand-size. – Peter Cordes Apr 28 '19 at 18:34
  • 5
    *and changing the rules there would also break old code.* Old code can't run in 64-bit mode anyway (e.g. 1-byte inc/dec are REX prefixes); this is irrelevant. The reason for *not* cleaning up the warts of x86 is fewer differences between long mode and compat/legacy modes, so fewer instructions have to decode differently depending on mode. AMD didn't know AMD64 was going to catch on, and was unfortunately very conservative so it would take fewer transistors to support. Long-term, it would have been fine if compilers and humans had to remember which things work differently in 64-bit mode. – Peter Cordes Apr 28 '19 at 18:37
  • 1
    @Neil C++ compilers in 64-bit mode do use 32-bit instructions which do zero extension, when possible, https://gcc.godbolt.org/z/plZCSm – Maxim Egorushkin Aug 17 '19 at 16:45
5

Without zero extending to 64 bits, it would mean an instruction reading from rax would have 2 dependencies for its rax operand (the instruction that writes to eax and the instruction that writes to rax before it), this would result in a partial register stall, which starts to get tricky when there are 3 possible widths, so it helps that rax and eax write to the full register, meaning the 64-bit instruction set doesn't introduce any new layers of partial renaming.

mov rdx, 1
mov rax, 6
imul rax, rdx
mov rbx, rax
mov eax, 7 //retires before add rax, 6
mov rdx, rax // has to wait for both imul rax, rdx and mov eax, 7 to finish before dispatch to the execution units, even though the higher order bits are identical anyway

The only benefit of not zero extending is ensuring the higher order bits of rax are included, for instance, if it originally contains 0xffffffffffffffff, the result would be 0xffffffff00000007, but there's very little reason for the ISA to make this guarantee at such an expense, and it's more likely that the benefit of zero extension would actually be required more, so it saves the extra line of code mov rax, 0. By guaranteeing it will always be zero extended to 64 bits, the compilers can work with this axiom in mind whilst in mov rdx, rax, rax only has to wait for its single dependency, meaning it can begin execution quicker and retire, freeing up execution units. Furthermore, it also allows for more efficient zero idioms like xor eax, eax to zero rax without requiring a REX byte.

Lewis Kelsey
  • 4,129
  • 1
  • 32
  • 42
  • 1
    Partial-flags on Skylake at least does work by having separate inputs for CF vs. any of SPAZO. (So `cmovbe` is 2 uops but `cmovb` is 1). But no CPU that does any partial-register renaming does it the way you suggest. Instead they insert a merging uop if a partial reg is renamed separately from the full reg (i.e. is "dirty"). See [Why doesn't GCC use partial registers?](https://stackoverflow.com/q/41573502) and [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/q/45660139) – Peter Cordes Mar 31 '20 at 19:24
  • P6-family CPUs either stalled for ~3 cycles to insert a merging uop (Core2 / Nehalem), or earlier P6-family (P-M, PIII, PII, PPro) just stall for (at least?) ~6 cycles. Perhaps that is like you suggested in 2, waiting for the full reg value to be available via writeback to the permanent/architectural register file. – Peter Cordes Mar 31 '20 at 19:29
  • @PeterCordes oh, I knew about merging uops at least for partial flag stalls. Makes sense, but I forgot how it works for a minute; it clicked once but I forgot to make notes – Lewis Kelsey Mar 31 '20 at 20:03
  • 1
    @PeterCordes microarchitecture.pdf: `This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to AL to make it independent of AH. The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX` I can't find an example of the 'merging uop' that would be used to solve this though, same for a partial flag stall – Lewis Kelsey Mar 31 '20 at 21:09
  • Right, early P6 just stalls until writeback. Core2 and Nehalem insert a merging uop after/before? only stalling the front-end for a shorter time. Sandybridge inserts merging uops without stalling. (But AH-merging has to issue in a cycle by itself, while AL merging can be part of a full group.) Haswell/SKL doesn't rename AL separately from RAX at all, so `mov al, [mem]` is a micro-fused load+ALU-merge, only renaming AH, and an AH-merging uop still issues alone. The partial-flag merging mechanisms in these CPUs vary, e.g. Core2/Nehalem still just stall for partial-flags, unlike partial-reg. – Peter Cordes Apr 01 '20 at 02:30
  • @PeterCordes have you speculated how the merging uop works? Is it a special uop that takes over the allocation stage or does it get inserted in the ROB and acts something like `mov eax,eax` but has a special opcode and is allocated differently by the RAT i.e. to wait on 2 ROB entries, and then the instruction that reads from eax only has dependency on that single ROB line maybe (but this would require the RAT keeping a more detailed allocation history, even though EAX and AL arent allocated differently) – Lewis Kelsey Apr 01 '20 at 14:01
  • I assume it has 2 inputs: the full-reg part and the AH part, and one output: a new value for the full reg. So it just reads both those parts whenever they're ready. The hardware that does partial-register renaming in the first place must know where the last write of the "full reg" part is, so that can just be one of the inputs. I'm not sure if SnB could ever get into a situation like AL and AH both renamed separately from EAX, so `not eax` would require a 3-input merge. I assume there's something special about merge uops that tells the RAT what to do, not just a `mov eax,eax` – Peter Cordes Apr 01 '20 at 14:09
  • It's not just zeroing idioms that can avoid REX prefixes: `mov eax, 7` is a 5-byte instruction while `mov rax, 7` is a 7-byte instruction. (extra REX and ModRM, although if AMD hadn't been going to do implicit zero-extension they might have done something else with the REX version of the mov reg, imm32 no-modrm opcodes and made them sign-extend a 32-bit immediate, and put 64-bit immediates on some other opcode. Possibly one with a modrm.) – Peter Cordes Mar 23 '21 at 04:18
4

From a hardware perspective, the ability to update half a register has always been somewhat expensive, but on the original 8088, it was useful to allow hand-written assembly code to treat the 8088 as having either two non-stack-related 16-bit registers and eight 8-bit registers, six non-stack-related 16-bit registers and zero 8-bit registers, or other intermediate combinations of 16-bit and 8-bit registers. Such usefulness was worth the extra cost.

When the 80386 added 32-bit registers, no facilities were provided to access just the top half of a register, but an instruction like ROR ESI,16 would be fast enough that there could still be value in being able to hold two 16-bit values in ESI and switch between them.

With the migration to x64 architecture, the increased register set and other architectural enhancements reduced the need for programmers to squeeze the maximum amount of information into each register. Further, register renaming increased the cost of doing partial register updates. If code were to do something like:

    mov rax,[whatever]
    mov [something],rax
    mov rax,[somethingElse]
    mov [yetAnother],rax

register renaming and related logic would make it possible to have the CPU record the fact that the value loaded from [whatever] will need to be written to something, and then--so long as the last two addresses are different--allow the load of somethingElse and store to yetAnother to be processed without having to wait for the data to actually be read from whatever. If the third instruction were mov eax,[somethingElse, however, and it were specified as leaving the upper bits unaffaected, the fourth instruction couldn't store RAX until the first load was completed, and even allowing even the load of EAX to occur would be difficult, since the processor would have to keep track of the fact that while the lower half was available, the upper half wasn't.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • 1
    Zeroing upper bits implicitly also makes 5-byte `mov eax, 1` (opcode + imm32) work as a way to set the full 64-bit register, instead of needing 7-byte `mov rax, sign_extended_imm32` (REX + opcode + modrm + imm32) or 10-byte `mov rax, imm64` (rex + opcode + imm64). And many other cases where zero-extending for free is useful, e.g. when using an unsigned 32-bit integer as an array index (part of an addressing mode), or a signed integer that's known to be non-negative. – Peter Cordes Apr 27 '21 at 01:39
  • So even apart from the false-dependency performance problems, it's more frequent that you want to clear high garbage than to merge with something. x86-64 *could* have had a movzx r64, r/m32 that you'd have to use every time you want that, but that would be worse. Especially if they want it to still be efficient to work with 32-bit integers like normal C type models (32-bit `int`, 64-bit pointers). Related: [MOVZX missing 32 bit register to 64 bit register](https://stackoverflow.com/q/51387571) - some ISAs like MIPS64 have made different choices, like keeping narrow values sign-extended. – Peter Cordes Apr 27 '21 at 01:43
  • 1
    @PeterCordes: A lot of other answers mentioned register renaming, but I thought that people who weren't already familiar with the concept could benefit from a more complete example. From an hardware-complexity or instruction-set-usability standpoint, I don't think it would have been difficult to have a prefix that would facilitate e.g. "add rax,signed byte[whatever]" or "add rsi,unsigned word[whatever]", and the effect of instruction size on performance has, for most purposes, diminished to almost nothing. The real issue is that tracking the additional dependencies is expensive. BTW... – supercat Apr 27 '21 at 14:58
  • 2
    ...I've sometimes wondered whether it would make sense to have a "universal" ABI which uses modified symbol names for entry points based upon expected calling convention. If one had an entry point for use only when all registers used for passing smaller-than-64-bit arguments were known to be extended appropriately for their type, then a compiler could use that entry point in cases where it knew that all arguments registers were already set up suitably, and an entry point that would zero- or sign-extend values as needed for use when the caller couldn't guarantee that. – supercat Apr 27 '21 at 15:03
  • Yup, this is a nice clear and specific example of the false-dependency problem that implicit extension avoids. Already upvoted. And yeah, they probably could have repurposed some of the other removed opcodes (like AAA / AAM / etc.) as a 64-bit-mode source-size / signedness override if they wanted to extend everything to 64-bit. (But that would have made instructions like `imul` slower on K8 (64-bit multiply wasn't as fast as 32), unless that also set the operand-size and truncated / extended the result from 32-bit to fill a reg.) But comments here aren't the place to discuss further :/ – Peter Cordes Apr 28 '21 at 01:10