Set all bits in CPU register to 1 efficiently

Question

To clear all bits you often see an exclusive or as in XOR eax, eax. Is there such a trick for the opposite too?

All I can think of is to invert the zeroes with an extra instruction.

`or eax, -1` has literally zero advantages over `mov eax, -1`, and it probably introduces a false dependency on the previous content of `eax`. `xor eax, eax` is convenient because it has a very compact encoding (and it's actually a special case in the register renaming circuitry). — Matteo Italia, Jul 14 '17 at 14:58
@MatteoItalia `or eax, -1` is two bytes shorter than `mov eax, -1`, but yah you probably don't want to be using the former in modern code. — Ross Ridge, Jul 14 '17 at 15:04
Ops, that's what happens when I only check encoded length for my 16 bit code golfs (where `mov` and `or` are just as big). Yeah, the `or` in 32 bit mode is smaller, I stand corrected. — Matteo Italia, Jul 14 '17 at 15:29
Given the x86-64 tag, I'm curious about the shortest byte sequence to set all bits of a 64-bit register... — Brett Hale, Jul 14 '17 at 16:29
The immediate while in the instruction can be very costly in terms of size. I am seeing a xor eax,eax plus a dec eax being 4 bytes total for the two. or eax,-1 as 3 bytes and mov eax,-1 as 5 bytes, so unless I messed up this assembly the or eax,-1 is the cheapest. with an encoding of 0x83,0xC8,0xFF. — old_timer, Jul 14 '17 at 17:16
@Jester - I always expect crazier encodings for any constant in 64-bit ops! — Brett Hale, Jul 14 '17 at 18:59
@MatteoItalia: I was just going to comment to confirm your guess that `or eax, -1` is not special-cased and has a false-dep. But then it turned into an answer :P — Peter Cordes, Jul 15 '17 at 01:05
If we start allowing pre-conditions, we'll be here all day. Also, that's only dependency-breaking on AMD Bulldozer-family, and is 2 uops on Intel pre-Broadwell. `sbb eax,eax` is a nice alternative to `setcc al`/`neg eax` or similar, especially since it doesn't have any annoying partial-register annoyances, but I don't think it's a good option if you just want a constant. — Peter Cordes, Jul 15 '17 at 18:47

score 30 · Accepted Answer · edited Jul 24 '22 at 18:05

For most architectures with fixed-width instructions, the answer will probably be a boring one instruction mov of a sign-extended or inverted immediate, or a mov lo/high pair. e.g. on ARM, mvn r0, #0 (move-not). See gcc asm output for x86, ARM, ARM64, and MIPS, on the Godbolt compiler explorer. IDK anything about zseries asm or machine code.

In ARM, eor r0,r0,r0 is significantly worse than a mov-immediate. It depends on the old value, with no special-case handling. Memory dependency-ordering rules prevent an ARM uarch from special-casing it even if they wanted to. Same goes for most other RISC ISAs with weakly-ordered memory but that don't require barriers for memory_order_consume (in C++11 terminology).

x86 xor-zeroing is special because of its variable-length instruction set. Historically, 8086 xor ax,ax was fast directly because it was small. Since the idiom became widely used (and zeroing is much more common than all-ones), CPU designers gave it special support, and now xor eax,eax is faster than mov eax,0 on Intel Sandybridge-family and some other CPUs, even without considering direct and indirect code-size effects. See What is the best way to set a register to zero in x86 assembly: xor, mov or and? for as many micro-architectural benefits as I've been able to dig up.

If x86 had a fixed-width instruction-set, I wonder if mov reg, 0 would have gotten as much special treatment as xor-zeroing has? Perhaps, because dependency-breaking before writing the low8 or low16 is important.

The standard options for best performance:

mov eax, -1: 5 bytes, using the mov r32, imm32 encoding. (There is no sign-extending mov r32, imm8, unfortunately). Excellent performance on all CPUs. 6 bytes for r8d-r15d (REX prefix).
mov rax, -1: 7 bytes, using the mov r/m64, sign-extended-imm32 encoding. (Not the REX.W=1 version of the eax version. That would be 10-byte mov r64, imm64). Excellent performance on all CPUs.

The weird options that save some code-size usually at the expense of performance:
(See also Tips for golfing in x86/x64 machine code)

xor eax,eax/dec rax (or not rax): 5 bytes (4 for 32-bit eax, or 3 bytes in 32-bit mode where 1-byte dec eax exists. 64-bit mode used those 1-byte instructions as REX prefixes). Downside: two uops for the front-end. Still only one unfused-domain uop for the scheduler/execution units on recent Intel where xor-zeroing is handled in the front-end. mov-immediate always needs an execution unit. (But integer ALU throughput is rarely a bottleneck for instructions that can use any port; the extra front-end pressure is the problem)
xor ecx,ecx / lea eax, [rcx-1] 5 bytes total for 2 constants (6 bytes for rax): leaves a separate zeroed register. If you already want a zeroed register, there is almost no downside to this. lea can run on fewer ports than mov r,i on most CPUs, but since this is the start of a new dependency chain, the CPU can run it in any spare execution-port cycle after it issues.

The same trick works for any two nearby constants, if you do the first one with mov reg, imm32 (or push imm8/pop) and the second with lea r32, [base + disp8]. disp8 has a range of -128 to +127, otherwise you need a disp32.

After a loop you may have a known-zero register, but LEA relative to it creates a false dependency, while mov-immediate wouldn't. Branch prediction + speculative exec can break control dependencies, although loop branches often mispredict their last iteration unless the trip count is low.
or eax, -1: 3 bytes (4 for rax), using the or r/m32, sign-extended-imm8 encoding. Downside: false dependency on the old value of the register.
push -1 / pop rax: 3 bytes. Slow but small. Recommended only for exploits / code-golf. Works for any sign-extended-imm8, unlike most of the others.

Downsides:
- uses store and load execution units, not ALU. (Possibly a throughput advantage in a rare cases on AMD Bulldozer-family where there are only two integer execution pipes, but decode/issue/retire throughput is higher than that. But don't try it without testing.)
- store/reload latency means rax won't be ready for ~5 cycles after this executes on Skylake, for example.
- (Intel): puts the stack-engine into rsp-modified mode, so the next time you read rsp directly it will take a stack-sync uop. (e.g. for add rsp, 28, or for mov eax, [rsp+8]).
- The store could miss in cache, triggering extra memory traffic. (Possible if you haven't touched the stack inside a long loop).

Vector regs are different

Setting vector registers to all-ones with pcmpeqd xmm0,xmm0 is special-cased on most CPUs as dependency-breaking (not Silvermont/KNL), but still needs an execution unit to actually write the ones. pcmpeqb/w/d/q all work, but q is slower on some CPUs and has longer machine code.

For AVX2, the ymm equivalent vpcmpeqd ymm0, ymm0, ymm0 is also the best choice. (Or b/w are equivalent, but vpcmpeqq has longer machine code.)

For AVX without AVX2 the choice is less clear: there is no one obvious best approach. Compilers use various strategies: gcc prefers to load a 32-byte constant with vmovdqa, while older clang uses 128-bit vpcmpeqd followed by a cross-lane vinsertf128 to fill the high half. Newer clang uses vxorps to zero a register then vcmptrueps to fill it with ones. This is the moral equivalent of the vpcmpeqd approach, but the vxorps is needed to break the dependency on the prior version of the register and vcmptrueps has a latency of 3. It makes a reasonable default choice.

Doing a vbroadcastss from a 32-bit value is probably strictly better than the load approach, but it is hard to get compilers to generate this.

The best approach probably depends on the surrounding code.

Fastest way to set __m256 value to all ONE bits

AVX512 compares are only available with a mask register (like k0) as the destination, so compilers are currently using vpternlogd zmm0,zmm0,zmm0, 0xff as the 512b all-ones idiom. (0xff makes every element of the 3-input truth-table a 1). This is not special-cased as dependency-breaking on KNL or SKL, but it has 2-per-clock throughput on Skylake-AVX512. This beats using a narrower dependency-breaking AVX all-ones and broadcasting or shuffling it.

If you need to re-generate all-ones inside a loop, obviously the most efficient way is to use a vmov* to copy an all-ones register. This doesn't even use an execution unit on modern CPUs (but still takes front-end issue bandwidth). But if you're out of vector registers, loading a constant or [v]pcmpeq[b/w/d] are good choices.

For AVX512, it's worth trying VPMOVM2D zmm0, k0 or maybe VPBROADCASTD zmm0, eax. Each has only 1c throughput, but they should break dependencies on the old value of zmm0 (unlike vpternlogd). They require a mask or integer register which you initialized outside the loop with kxnorw k1,k0,k0 or mov eax, -1.

For AVX512 mask registers, kxnorw k1,k0,k0 works, but it's not dependency-breaking on current CPUs. Intel's optimization manual suggests using it for generating an all-ones before a gather instruction, but recommends avoiding using the same input register as the output. This avoids making an otherwise-independent gather dependent on a previous one in a loop. Since k0 is often unused, it's usually a good choice to read from.

I think vpcmpeqd k1, zmm0,zmm0 would work, but it's probably not special-cased as a k1=1 idiom with no dependency on zmm0. (To set all 64 bits instead of just the low 16, use AVX512BW vpcmpeqb or kxnorq)

On Skylake-AVX512, k instructions that operate on mask registers only run on a single port, even simple ones like kandw. (Also note that Skylake-AVX512 won't run vector uops on port1 when there are any 512b operations in the pipe, so execution unit throughput can be a real bottleneck.)

There is no kmov k0, imm, only moves from integer or memory. Probably there are no k instructions where same,same is detected as special, so the hardware in the issue/rename stage doesn't look for k registers.

What, there's such a thing as `k` registers and a full separate instruction set to work on them! Sooner or later I'll have to actually learn something about AVX-512... — Matteo Italia, Jul 29 '17 at 08:09
I'm enjoying this read again half a year later. The `xor ecx,ecx / lea eax` idea fits many cases. — Pascal de Kloe, Feb 10 '18 at 12:01
@PascaldeKloe: yeah, it's interesting because it's one of the few that doesn't suck for performance, while being as short as `push imm8` / `pop` if you already have a register with any known value. [Very useful for code-golf, too](https://codegolf.stackexchange.com/users/30206/peter-cordes). — Peter Cordes, Feb 10 '18 at 12:23
I just changed a bunch of code from `add(x, 1)` to `sub(x, -1)`. The ultimate premature optimization. — Mysticial, Jun 21 '19 at 03:23
I think the AVX version of this is incomplete, as it's not obvious what the AVX version of `pcmpeqd` is. Compilers seem to generate loads from memory (gcc, not sure why broadcast not used?) or `vcmptrueps` (clang). — BeeOnRope, Aug 24 '19 at 06:27
@BeeOnRope: I wasn't really intending this to be a reference answer that covered all cases when I wrote it. I did link to an AVX/AVX2 answer that mentions what compilers do for the AVX1 without AVX2 case. And yeah, gcc is terrible in general at using broadcast-loads to shrink constants, I don't think it ever does it. (Maybe it doesn't have a mechanism to avoid duplication if one function can hoist a constant to a register while another uses it as a memory source. So they prioritize keeping constants simple? Or just nobody's written a constant-shrinking optimizer pass.) — Peter Cordes, Aug 24 '19 at 06:36
Yeah but that question also links back here. Maybe a minimal change could be to remove AVX from "The AVX/AVX2 version of this is also the best choice there.", although that kind of leaves AVX users hanging. — BeeOnRope, Aug 24 '19 at 06:43
I realize now that `vcmptrueps` is actually just the FP version of `cmpeqd same, same, same`... — BeeOnRope, Aug 24 '19 at 07:39
@BeeOnRope: It has a false dependency on the input (but not the non-destructive destination) so it's half way between `vpcmpeqd` and `vpternlogd` as far as being potentially problematic for materializing an all-ones vector. — Peter Cordes, Aug 24 '19 at 07:57
@PeterCordes - yeah but in practice you usually always use `same, same, same` so the dependency ends up being the same in either case? I suppose if you had a really big brain you could use `reg1, reg2, reg2` where `reg2` is a dontcare register you know is likely to be ready, and where `reg1` is more likely to have a dependency, could be slightly better than `reg1, reg1, reg1`. — BeeOnRope, Aug 24 '19 at 08:00
@BeeOnRope: exactly. Most of the audience for this question is probably compiler or JIT *devs*, or hand-written code, so the fact that current compilers don't know to avoid a false dep doesn't matter. Another point is that `vmovaps` from an existing all-ones register is zero-latency and no back-end uops, so if you need to rematerialize a constant in a loop for some reason, that's cheaper than `vcmptrueps`. — Peter Cordes, Aug 24 '19 at 08:06
I'm pretty sure even JIT'd code and hand-written code is also very rarely pulling the `reg1, reg2, reg2` trick. It's like a 4th order concern. Agreed in the rematerialization: that applies to allost every technique on the "ones" side of the fence, I think? — BeeOnRope, Aug 24 '19 at 14:01
@BeeOnRope: yes, on CPUs with mov-elimination, because even `pcmpeqd` requires an ALU uop. But it runs on more ports than `vcmptrueps` on Skylake. The `reg1, reg2, reg2` trick is relevant for setting `k` registers with `kxnor` as well as AVX1 without AVX2. Otherwise not, so I'm not surprised it's rare. `vcmptrueps` has higher latency so a false dependency may be more problematic. Anyway, most JIT code is not super well optimized, and most people don't read this answer before writing hand-written code. But hopefully we can give useful guidance for the few that do :P — Peter Cordes, Aug 24 '19 at 19:47
@PeterCordes A somewhat unrelated question but I have a pretty common case where I essentially have a branch on some register being 0 and after that branch need to make a register all ones. I generally use the `mov $-1, r` but could save could using `not r` after the branch i.e: `testl r0, r0; jnz L(do_something); // mask either or `. If the branch is predicted not taken could there potentially be a false dependency during speculative execution on `r0` when trying to make all 1s with `not`? (my fear hence the `mov`). — Noah, Apr 27 '21 at 03:04
@Noah: Yes, branch prediction + speculative exec breaks data dependencies. Reintroducing one by using `not %reg` or `or $-1, %reg` instead of `mov $-1, %reg` would couple that code back into the dependency chain before the branch. If other things also couple anyway, then it may be fine, although remember that if there are multiple inputs, any single one of them could be ready early or late, and you ideally want to keep separate dep chains so as much work can be done on the ones that are ready before stalling. — Peter Cordes, Apr 27 '21 at 03:09
re: " but it's probably not special-cased as a k0=1 idiom with no dependency on zmm0." are you saying `vpcmpeqd k0, zmm0,zmm0` is special cased and has no dependency on `zmm0`? — Noah, Apr 30 '21 at 19:19
@Noah: No, I'm saying if there was a special case, that's what it would be, but that it's probably *not* special-cased at all. Now that you point it out, it's pretty clunky phrasing :/ — Peter Cordes, May 01 '21 at 00:08
@PeterCordes got it. AFAICT the only special case is for [zeroing a mask register with `vpcmpgt`](https://uops.info/html-lat/TGL/VPCMPGTB_K_ZMM_ZMM-Measurements.html#lat2-%3E1_same_reg) — Noah, May 01 '21 at 00:23
@PeterCordes Semi-unrelated question. Lets say you have length + bitvector of positions (say from memchr). If you get bitvector back from `vpmovmskb ymm, r0` with chance of match being in bounds / out of bounds do you A) compare `tzcnt r0, r0; cmpl r0, length; jle` or B) `bzhil length, r0, scratch; jz`. Assuming result of `tzcnt` will be needed if match in bounds and result of `bzhi` will never be needed. The former has less total work, but 3c latency on condition vs latter more work with less latency on condition. Curious on your though process to choose between the two. — Noah, May 28 '21 at 13:46
@Noah: Hard to decide for a generic library function. Lower latency to feed a branch helps if you expect the branch to mispredict a significant amount of the time, irrelevant if it predicts well. (A couple extra cycles added to the cost of a mispredict, which is already about 10 to 15 cycles IIRC.) Also, AMD's tzcnt is 2 cycle latency (but 2 uops, prob. bit-reverse + lzcnt). Some BMI instructions on AMD are 2 uops vs. Intel's 1, but BZHI is single-uop on Zen. Still, with macro-fusion of the cmp/jle, I'd be inclined to use tzcnt ahead of compare, avoiding bzhi. — Peter Cordes, May 28 '21 at 18:05
@PeterCordes are there any non-zero constants that a register can be set to more efficiently than `movl`? — Noah, Sep 08 '21 at 22:39
@PeterCordes your answer makes it sound like only `vpcmpeqd` actually breaks deps but at least on my TGL `b/w/d/q` all do the trick. Is is only `vpcmpeqd` on older CPUs? — Noah, Jun 26 '22 at 17:18
@Noah: I was hoping the previous paragraph talking about non-VEX `pcmpeqb/w/d/q` was sufficient, but I can see how it looked like I was saying that only `vpcmpeqd` was efficient for AVX2. Fixed. The `q` version takes an extra byte of machine code, though, so you don't want it. If you want to set XMM15, use `vpcmpeqd xmm15, xmm0,xmm0` to still allow a 2-byte VEX. — Peter Cordes, Jun 26 '22 at 17:55
"I think `vpcmpeqd k1, zmm0,zmm0` would work, but it's probably not special-cased as a k0=1 idiom with no dependency on zmm0." The instruction you listed there appears to have a `k1` destination but the text reads `k0=1`. Is this in error? — ecm, Jun 26 '22 at 18:28

Kai Burghardt · Answer 2 · 2019-01-27T21:41:39.880

2

Peter's already provided a perfect answer. I just wanna mention, that it depends on the context, too.

I for once did a sar r64, 63 of a number I know will be negative in a certain case, and if not, I don't need no all bits set value. A sar has the advantage that it sets some interesting flags, although decoding 63, really?, then I could've done a mov r64, -1, too. I guess it was the flags, that let me do it anyway.

So bottom line: context. As you know, you usually delve into assembly language, because you want to process the extra knowledge you, but not the compiler has. Maybe some of your registers whose value you don't need anymore has a 1 stored (so logical true), then just neg it. Maybe somewhere earlier in your program you did a loop, then (provided it is manageable) you can arrange your register usage so a not rcx is all that's missing.

edited Jan 27 '19 at 21:41

answered Jan 27 '19 at 14:52

Kai Burghardt

1,046
1
8
23

Do you mean `sar r64, 63`? You need an arithmetic, not logical, right shift to broadcast the sign bit to all bits. – Peter Cordes Jan 27 '19 at 20:45
Interesting, and same code size as `or r64, -1` (both REX + one-byte opcodes + ModRM + an imm8), so sure if you want the flag result then that's potentially a win, if you're not bottlenecked on shift-port throughput. And yeah, `not` or `neg` will save a byte vs. `or imm8`, while having the same "false" dependency on the old value. It's too bad x86-64 didn't use some of the freed-up opcodes from removing BCD instructions and `push seg_reg` for a `mov r/m32, sign-extended-imm8` opcode. That would give us 3-byte `mov eax, -1` and 4-byte `mov rax,-1` (vs. 5 and 7) – Peter Cordes Jan 27 '19 at 20:51
Yeah, of course `sar`, not `shr`. Duly noted. Thanks for pointing it out. I'm usually not too concerned about space though, but about speed. – Kai Burghardt Jan 27 '19 at 21:43
If you're optimizing for speed on a modern out-of-order x86-64, why would you ever use `neg` or `not` instead of `mov r64, -1`? Did you find that using a shorter insn helped avoid a front-end bottleneck? If you also need to set something in FLAGS, then sure, but NOT doesn't affect flags. And you mentioned `loop`, which is slow on everything except AMD Bulldozer-family and Ryzen, so you wouldn't use that if optimizing for speed unless your code would only run on recent AMD. [Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?](//stackoverflow.com/a/52980461) – Peter Cordes Jan 27 '19 at 21:49
I also don't like my code being readable. `mov r64, -1` is too obvious. I generally write my code for the future, for future processors. Using more specialized instructions gives the CPU more hints then, it doesn't have to untangle everything (although they're really good at that today). – Kai Burghardt Jan 28 '19 at 01:05
2

CPUs aren't compilers. They don't optimize between instructions except in the special case of macro-fusion between compare-and-branch, and on Sandybridge-family dec/inc/add/sub/and + JCC. What matters is whether or not an instruction has an input dependency on an old value. (And for code as a whole, how many uops for the front-end, and stuff like that.) "more information" is total nonsense, if you're talking about a slow instruction like `loop`. If you actually care about your code running fast, go read Agner Fog's guides, https://agner.org/optimize/. If you don't, keep obfuscating. – Peter Cordes Jan 28 '19 at 01:39

Set all bits in CPU register to 1 efficiently

2 Answers2

Vector regs are different

Linked

Related