In x86_64, does a 32-bit cmov clear the top bits if the condition is false?

Question

In 64-bit mode on x86, most 32-bit arithmetic operations clear the top 32 bits of the destination register. What if the arithmetic operation is a "cmov" instruction, and the condition is false? (This case does not seem to be covered in the reference manuals I've looked at).

Peter Cordes · Accepted Answer · 2021-03-03T02:21:35.533

5

It always zero-extends into the destination, like all instructions that write a 32-bit register.

Think of CMOV as always writing its destination: it's an ALU select operation (3 inputs: 2 integer operands and flags, 1 output).

It's not like ARM 32-bit mode predicated instructions that truly act like a NOP when the condition is false.

(For the same reason, cmovcc reg, [mem] always loads the memory operand, even if the condition is false, and doesn't do fault-suppression on a bad address. Again, it's not the move itself that's conditional, it's moving the result of a conditional-select operation. AArch64 picked a better name for their equivalent of the same instruction, csel.)

There is one case where a 32-bit destination may not be zero-extended: bsr and bsf r32,r/m32 when the source is zero leaves the destination unmodified. (Only documented by AMD (If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register.), but implemented by Intel as well). In practice on Intel CPUs at least, this includes leaving the upper bits unmodified after an instruction like bsf eax, ecx. I haven't tested AMD.

(This is why BSF and BSR have "false" dependencies on the destination: implementing this behaviour branchlessly requires a true dependency. It's only a false output dependency for LZCNT/TZCNT/POPCNT on Intel that run on the same execution unit but always overwrite it.)

(Wikipedia) claims there's some kind of difference between Intel and AMD for the upper bits after bsf r32, r/m32. They seem to be saying that Intel (or maybe AMD; phrasing is somewhat ambiguous) leaves the upper bits undefined for the source=0 case, instead of unmodified.

It seems always unmodified in my testing on Sandybridge-family and Core 2, but I don't have access to a P4 Nocona / Prescott, which was the first-gen IA-32e microarchitecture.

The Wikipedia editor who wrote that may just be misinterpreting Intel's documentation which says the whole destination register is "undefined" in this case. (But it's normal for Intel to, in silicon, go beyond what they guarantee on paper, so existing software they care about, e.g. Windows, keeps working). IDK if there's another source for that claim, so I guess [citation-needed] would truly be appropriate here.

edited Mar 03 '21 at 02:21

answered Mar 01 '21 at 04:00

Peter Cordes

328,167
45
605
847

@Joshua: You can of course use `cmovz rax, rdx` if you want a 64-bit cmov. You'd have to have already zero-extended something into RDX (e.g. by writing EDX) if it wasn't naturally a 64-bit value. I don't understand the "yuck" reaction; it seems pretty normal. Or are you talking about the faulting memory source, that it's not a predicated *load*? – Peter Cordes Mar 01 '21 at 05:00
Yeah I'm talking about the faulting memory source. Makes the opcode a lot less useful. – Joshua Mar 01 '21 at 05:11
@Joshua: Then yeah, agreed, a predicated load would have been nice. As my linked answer points out, x86 is pretty light on those until AVX-512. – Peter Cordes Mar 01 '21 at 05:14
You know, I'd love an instruction that just returns 0 if it would have faulted. – Joshua Mar 01 '21 at 05:15
1

@Joshua: You'd probably want a try-load instruction to set FLAGS instead of signalling failure in-band in the register value. (Like I wrote in [Is it possible to “abort” when loading a register from memory rather the triggering a page fault?](https://stackoverflow.com/q/52221575) last time *I* was lamenting x86's lack of a try_load instruction. Itanium had something like it, with speculative loads that could produce a Not-A-Thing value; internal metadata that made a later instruction fault if it tried to read the NaT register value.) – Peter Cordes Mar 01 '21 at 05:39
Other questionable case is the exchange instruction such as `XCHG EAX,ECX`. Does it *write* to both registers or it just swaps their names? I had to try this in x64dbg and yes, **XCHG zeroes upper halves** of r64, too. – vitsoft Mar 01 '21 at 07:43
1

@vitsoft: yup, that's why `0x90` is not an encoding of `xchg eax,eax` in 64-bit mode; it had to get documented specifically as a NOP (https://www.felixcloutier.com/x86/nop). The footnote / PS in the description of https://www.felixcloutier.com/x86/xchg mentions this. You'll note that an x86-64 assembler will encode `xchg eax,eax` using the ModRM encoding, not the short-form it can use for `xchg eax, ecx` or any other register. – Peter Cordes Mar 01 '21 at 07:50
@vitsoft: A more interesting case might be `cmpxchg ecx, ebx` - it conditionally writes either EAX or ECX, but not both. https://www.felixcloutier.com/x86/cmpxchg. I assume register upper-zeroing follows the documented behaviour of what gets written; it is 5 uops on SnB-family and Zen (for either a register or memory destination.) – Peter Cordes Mar 01 '21 at 07:53
Obviously not robust or guaranteed but don't see upper bits of destination of ```bsfq``` being modified with 0 src. – Noah Mar 02 '21 at 18:05
1

@Noah: I just tried it myself on a Core 2 in case it changed in SnB, but same unmodified full result. I only tried single-stepping, not with OoO exec forwarding between in-flight instructions. Or maybe I'm misinterpreting Wiki's wording, or maybe it was only P4 Prescott / Nocona where it was anything other than full-reg preserved? – Peter Cordes Mar 02 '21 at 19:05
1

@PeterCordes I think the wiki is misleadingly incorrect. The entire destination is undefined according to intel with 32 or 64 bit ```bsf```. In practice though it seems neither modified any of the 64 bits of destination. Tested with a bit of OoO exec inflight and still see the same result. – Noah Mar 03 '21 at 01:01
1

@Noah: Or maybe the phrasing sucks and it's AMD that leaves upper bits undefined? That would be weird because they document the behaviour, and as you say Intel says undefined for the whole value. So yeah it's also very possible someone just misinterpreted written docs, and never tested on real CPUs. But we also can't rule out P4 Nocona being totally different from P6 or SnB families. I'll edit this answer to be more skeptical. – Peter Cordes Mar 03 '21 at 02:05
1

@Noah: I updated wikipedia as well, with something that's probably too verbose for that list, but also mentions the documented behaviour 0 behaviour (because apparently [some people, e.g. this blog](http://msinilo.pl/blog2/post/p1260/) confused that wikipedia "difference" with the normal on-zero behaviour). If anyone has further info to contribute, edit there and please let me know. – Peter Cordes Mar 03 '21 at 07:22
A few things regarding the wikipedia edit: 1) I checked on ICL if your skylake citation was about my tests. 2) "instructions act differently than AMD64" is misleading because the paragraph essentially is saying: *they are documented differently but act the same*. Maybe "documented differently" or "documented to act differently". 3) Intels documentation for ```bsf``` says that the **entire** destination register is undefined versus AMD which says the destination is unmodified. I don't see why the "difference" between amd/intel implementation is isolated to a discussion about just upper 32 bits. – Noah Mar 03 '21 at 17:29
1

@Noah: Oh good point, there is a real AMD/Intel difference in documentation. The more I think about it, the more I suspect that the "upper 32" apparent nonsense stemmed from a misreading of the docs. (Unless Intel's early manuals said something very different back then.) Especially because the wording "undefined" sounds like it comes from docs, not experimental testing. The Core 2 and SKL testing was based on the 2 CPUs I tried it on myself. If I (or you) edit wikip again, we can bump it to Ice Lake; I'd forgotten what CPU you had. – Peter Cordes Mar 04 '21 at 02:51
@PeterCordes do you think this is a safe enough "feature" to take advantage of in glibc? – Noah Mar 21 '21 at 01:16
1

@Noah: Probably not. I don't *think* Intel's likely to change it in future CPUs, but I really can't rule it out, and something like glibc itself breaking on future HW is too critical a possible problem. The cost to avoid depending on it is usually only a couple extra uops, and if we're talking about the cleanup of `memchr` or whatever it's not even inside the loop. (Although for strings/buffers <= 16 bytes, an extra cmov would be a significant part of total throughput and latency.) If the saving was a lot larger, it would be more worth considering, but still prob. no for glibc. – Peter Cordes Mar 21 '21 at 01:50
1

@Noah: I'd have no problem with a game or other desktop software shipping with code that depends on that BSF or BSR feature. And certainly something that was supposed to run on your own cloud servers, so if it breaks you can fix it, and you somewhat control the hardware. But something like glibc or GCC-output that needs to work for a GNU/Linux distro to even boot to a usable state, no, gotta stick to what we can convince vendors to actually document. (Except via build options to let tweakers take advantage of HW features like that and de-facto atomic 32-byte AVX stores.) – Peter Cordes Mar 21 '21 at 01:56

In x86_64, does a 32-bit cmov clear the top bits if the condition is false?

1 Answers1