Fastest way to set __m256 value to all ONE bits

Question

How can I set a value of 1 to all bits in an __m256 value? Using either AVX or AVX2 intrinsics?

To get all zeros, you can use _mm256_setzero_si256().

To get all ones, I'm currently using _mm256_set1_epi64x(-1), but I suspect that this is slower than the all-zero case. Is there memory access or Scalar/SSE/AVX switching involved here?

And I can't seem to find a simple bitwise NOT operation in AVX? If that was available, I could simply use the setzero, followed by a vector NOT.

In former times, people used `pcmpeqd xmm0, xmm0` for that, presumably there is an equivalent operation in AVX{2}? — njuffa, May 26 '16 at 20:18
@njuffa `vpcmpeqd` in AVX2. Clang seems to optimize the `_mm256_set1_epi64x(-1);` to that, the same as `_mm256_cmpeq_epi64(_mm256_setzero_si256(), _mm256_setzero_si256());` — Dan Mašek, May 26 '16 at 20:19
Have a look at section 13.8 *Generating constants* in [Agner Fog's An optimization guide for x86 platforms](https://www.agner.org/optimize/optimizing_assembly.pdf) — phuclv, Mar 24 '21 at 05:09

Peter Cordes · Accepted Answer · 2020-06-24T19:06:15.713

See also Set all bits in CPU register to 1 efficiently which covers AVX, AVX2, and AVX512 zmm and k (mask) registers.

You obviously didn't even look at the asm output, which is trivial to do:

#include <immintrin.h>
__m256i all_ones(void) { return _mm256_set1_epi64x(-1); }

compiles to with GCC and clang with any -march that includes AVX2

    vpcmpeqd        ymm0, ymm0, ymm0
    ret

To get a __m256 (not __m256i) you can just cast the result:

  __m256 nans = _mm256_castsi256_ps( _mm256_set1_epi32(-1) );

Without AVX2, a possible option is vcmptrueps dst, ymm0,ymm0 preferably with a cold register for the input to mitigate the false dependency.

Recent clang (5.0 and later) does xor-zero a vector then vcmpps with a TRUE predicate if AVX2 isn't available. Older clang makes a 128bit all-ones with vpcmpeqd xmm and uses vinsertf128. GCC loads from memory, even modern GCC 10.1 with -march=sandybridge.

As described by the vector section of Agner Fog's optimizing assembly guide, generating constants on the fly this way is cheap. It still takes a vector execution unit to generate the all-ones (unlike _mm_setzero), but it's better than any possible two-instruction sequence, and usually better than a load. See also the x86 tag wiki.

Compilers don't like to generate more complex constants on the fly, even ones that could be generated from all-ones with a simple shift. Even if you try, by writing __m128i float_signbit_mask = _mm_srli_epi32(_mm_set1_epi16(-1), 1), compilers typically do constant-propagation and put the vector in memory. This lets them fold it into a memory operand when used later in cases where there's no loop to hoist the constant out of.

And I can't seem to find a simple bitwise NOT operation in AVX?

You do that by XORing with all-ones with vxorps (_mm256_xor_ps). Unfortunately SSE/AVX don't provide a way to do a NOT without a vector constant.

FP vs Integer instructions and bypass delay

Intel CPUs (at least Skylake) have a weird effect where the extra bypass latency between SIMD-integer and SIMD-FP still happens long after the uop producing the register has executed. e.g. vmulps ymm1, ymm2, ymm0 could have an extra cycle of latency for the ymm2 -> ymm1 critical path if ymm0 was produced by vpcmpeqd. And this lasts until the next context switch restores FP state if you don't otherwise overwrite ymm0.

This is not a problem for bitwise instructions like vxorps (even though the mnemonic has ps, it doesn't have bypass delay from FP or vec-int domains on Skylake, IIRC).

So normally it's safe to create a set1(-1) constant with an integer instruction because that's a NaN and you wouldn't normally use it with FP math instructions like mul or add.

You can also produce a NOT as follows: not_a = _mm256_andnot_ps(a, all_ones); — ChipK, Jul 10 '18 at 21:36
@ChipK: I seem to recall you doing the same thing recently, that's why I complained. If that was a different user, then nvm. Try to have your comment finished before you post it. Accidents happen, but don't do it on purpose. If I'm on SO, I'll often look at a comment notification right away when it pops up, so if necessary I can reply while the person is also still there. Anyway, yes ANDN works, too, but then you have to remember which operand is the one that's NOTed, and it doesn't work as a load (only the non-memory operand can be NOTed; it's not commutative). — Peter Cordes, Jul 10 '18 at 21:44
Anyway, thanks for pointing out ANDN. But since it still requires a vector of all-ones, and has zero advantages over XOR, I don't think it's worth suggesting as an alternative to consider. IDK if some people would find it more readable. But for me, XOR with ones is immediately understandable. — Peter Cordes, Jul 10 '18 at 21:48
Sorry, I was trying to add a carriage return between my text and my code and it added the comment - simple mistake (difference between adding a comment and adding an answer). BTW, I don't think it was me that you were pointing out priorly. — ChipK, Jul 10 '18 at 22:00
You link to this [other question](https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-1-efficiently/45113467#45113467), which for the AVX/AVX2 case says "The AVX/AVX2 version of this [`pcmpeqd`] is also the best choice there", but what is the AVX (not AVX2) version of `pcmpeqd`? Then from there you link back here. — BeeOnRope, Aug 24 '19 at 06:20
@BeeOnRope: With this question being tagged AVX2, it seems I didn't put too much effort into the AVX1 case. The real answer being: look at compiler output and consider their choices. That should probably get updated. — Peter Cordes, Aug 24 '19 at 06:39
@PeterCordes re:"Compilers don't like to generate more complex constants on the fly, even ones that could be generated from all-ones with a simple shift" Working on it in LLVM. Its a bit more complicated than one might think but is on the horizon. — Noah, Mar 29 '23 at 04:16

Fastest way to set __m256 value to all ONE bits

1 Answers1

Linked

Related