16

Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector.

If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set up a vector of all 1s to use that approach.

Cole Tobin
  • 9,206
  • 15
  • 49
  • 74
SODIMM
  • 303
  • 2
  • 12
  • 1
    Setting up a vector of all `1`s is not particularly difficult: `[v]pcmpe[typesize] %[x/y]mmN, %[x/y]mmN[, %[x/y]mmN]` or thereabouts. A single instruction to set up the constant does not seem too onerous. If you have a particular aversion to `xor`, `pandn` and `andnps` are also available. – EOF Mar 05 '17 at 21:13
  • 1
    It's not terrible - but it's 2x as long as I'd expect for a basic operation like this. Of course, the constant could be hoisted, at the expense of a register. Anyway, just checking my assumption that I wasn't missing this somewhere. @EOF – SODIMM Mar 05 '17 at 21:15
  • 1
    Given that `pcmpeXX` has been recognized as dependency-breaking since at least Sandy Bridge (according to Agner Fog's microarchitecture manuals), whether it takes one or two instructions to negate a vector will not matter in almost all cases. – EOF Mar 05 '17 at 21:20
  • 2
    I agree in general. It matters in my case. I am throughput and port constrained on the 3 vector ports. Every vector operation costs me 1/3 of a cycle (within reason). @EOF – SODIMM Mar 05 '17 at 21:35
  • 3
    There is a ``ANDNPD`` (and-not) in SSE. – Chuck Walbourn Mar 05 '17 at 22:31
  • There's always the `~x = - x - 1` identity, too. `-1 - x` might be useful. – Brett Hale Mar 06 '17 at 05:42
  • 3
    Similarly: where's the `PNEG` instruction? – Joost Mar 17 '17 at 15:44
  • ISA design is based on a lot of research effort. The result is that `NOT` is not a very commonly used instruction that worth some die space. `ANDN` and `XOR` are much more useful to most projects – phuclv May 09 '17 at 01:28
  • 1
    I guess then it is reasonable to ask why `not` was included in the original (non-SIMD) x86 ISA with a nice short opcode and why `andn` didn't appear until some 20 years later? – SODIMM Jul 03 '17 at 21:37
  • 1
    Well presumably ISA design research made significant progress between 8086 and SSE, or BMI2. Or SIMD-vector NOT was less commonly useful than scalar integer `not`. – Peter Cordes Jul 14 '17 at 04:20
  • 1
    If you're vector-ALU bound and out of registers (preventing you or the compiler from hoisting the `pcmpeqd same,same` out of the loop), put your all-ones constant in memory. `PXOR xmm0, [allones]` micro-fuses into a load+ALU uop, so it doesn't cost any extra issue bandwidth. Repeated loads of the same constant will hit in L1D cache. – Peter Cordes Jul 14 '17 at 04:22

4 Answers4

16

For cases such as this it can be instructive to see what a compiler would generate.

E.g. for the following function:

#include <immintrin.h>

__m256i test(const __m256i v)
{
  return ~v;
}

both gcc and clang seem to generate much the same code:

test(long long __vector(4)):
        vpcmpeqd        ymm1, ymm1, ymm1
        vpxor   ymm0, ymm0, ymm1
        ret
Paul R
  • 208,748
  • 37
  • 389
  • 560
8

If you use Intrinsics you can use an inline function like this to have the not operation separately.

 inline __m256i _mm256_not_si256 (__m256i a){    
     //return  _mm256_xor_si256 (a, _mm256_set1_epi32(0xffffffff));
     return  _mm256_xor_si256 (a, _mm256_cmpeq_epi32(a,a));//I didn't check wich one is faster   
 }
Paul R
  • 208,748
  • 37
  • 389
  • 560
Amiri
  • 2,417
  • 1
  • 15
  • 42
  • 3
    Good compilers will optimize usually `_mm256_set1_epi32(-1)` to a `vpcmpeqd same,same`. I guess with AVX it's probably not going to hurt the compiler to try to "trick" it into emitting that if it wouldn't normally. (With SSE it could cost extra MOVDQA instructions, but AVX 3-operand encoding solves that.) – Peter Cordes Jul 14 '17 at 04:26
7

AVX512F vpternlogd / _mm512_ternarylogic_epi32(__m512i a, __m512i b, __m512i c, int imm8) finally provides a way to implement NOT without any extra constants, using a single instruction which can run on any vector ALU port on Skylake-avx512.

And with AVX512VL, for 128 and 256-bit vectors as well without dirtying the upper part of a ZMM. (All AVX512 CPUs except Xeon Phi have AVX512VL).

On Intel CPUs, it can run on any of port 0, 1, or 5, so has 3/clock throughput for the 128 and 256-bit versions. Or as usual, 2/clock for 512-bit vectors, since port 1 is shut down when any 512-bit uops are in flight. https://www.uops.info/html-instr/VPTERNLOGD_XMM_XMM_XMM_I8.html).


vpternlogd zmm,zmm,zmm, imm8 has 3 input vectors and one output, modifying the destination in place. With the right immediate, you can still implement a copy-and-NOT into a different register, but it will have a "false" dependency on the output register (which vpxord dst, src, all-ones wouldn't).

TL:DR: probably still use xor with all-ones as part of a loop, unless you're running out of registers. vpternlog may cost an extra vmovdqa register-copy instruction if its input is needed later.

Outside of loops, vpternlogd zmm,zmm,zmm, 0xff is the compiler's best option for creating a 512b vector of all-ones in the first place, because AVX512 compare instructions compare into masks (k0-k7), so XOR with all-ones might already involve a vpternlogd, or maybe a broadcast-constant from memory, for 512-bit vectors. Or a dep-breaking ALU uop for 128 or 256-bit vpcmpeqd same,same.


For each bit position i, the output bit is imm[ (DEST[i]<<2) + (SRC1[i]<<1) + SRC2[i]], where the imm8 is treated as an 8-element bitmap.

Thus, if we want the result to depend only on SRC2 (which is the zmm/m512/m32bcst operand), we should choose a bitmap of repeating 1,0, with 1 at the even positions (selected by src2=0).

vpternlogd  zmm1,zmm1, zmm2,  01010101b  ; 0x55  ; false dep on zmm1

If you're lucky, a compiler will optimize _mm512_xor_epi32(v, _mm512_set1_epi32(-1)) to vpternlogd for you if it's profitable.

// To hand-hold a compiler into saving a vmovdqa32 if needed:
__m512i tmp = something earlier;
__m512i t2 = _mm...(tmp);
// use-case: tmp is dead, t2 and ~t2 are both needed.
__m512i t2_inv = _mm512_ternarylogic_epi32(tmp, t2, t2, 0b01010101);

If you're not sure that's a good idea, just keep it simple and use the same variable for all 3 inputs:

__m512i t2_inv = _mm512_ternarylogic_epi32(t2, t2, t2, 0b01010101);
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I don't agree with "it's not quite as good as PXOR" part. 2 ops/cycle is maximum possible throughput for AVX512 commands on Skylake, and only a few commands in PADD b PAND groups reach this speed. 2*512 > 3*256, so on these commands AVX512 is still 33% faster than AVX2 – Bulat Mar 11 '20 at 17:45
  • @Bulat: That's not what I was saying. It's less good because it can't copy-and-not without a false dependency on the output register, so your compiler might waste some front-end throughput on a `vmovdqa32`. Also, 128 and 256-bit versions of AVX512VL instructions are often 3/clock, including `vpternlogd ymm`. If you're only using 256-bit vectors to avoid turbo penalties in some small part of your whole program, you can still use `vpternlogd ymm`. Also, FP FMA/add/mul are 2/clock on Gold so IDK what you're talking about with "only a few" instructions running at 2/clock for 512-bit vectors. – Peter Cordes Mar 11 '20 at 18:18
3

You can use the PANDN OpCode for that.

PANDN implements the operation

DEST = NOT(DEST) AND SRC   ; (SSEx)

or

DEST = NOT(SRC1) AND SRC2  ; (AVXx)

Combining this operation with an all-ones vector effectively results in a PNOT operation.


Some x86(SSEx) assembly code would look like this:

; XMM0 is input register
PCMPEQB   xmm1, xmm1        ; Whole xmm1 reg set to 1's
PANDN     xmm0, xmm1        ; xmm0 = NOT(xmm0) AND xmm1
; XMM0 contains NOT(XMM0)

Some x86(AVXx) assembly code would look like this:

; YMM0 is input register
VPCMPEQB  ymm1, ymm1, ymm1  ; Whole ymm1 reg set to 1's
VPANDN    ymm0, ymm0, ymm1  ; ymm0 = NOT(ymm0) AND ymm1
; YMM0 contains NOT(YMM0)

Both can (of course) easily be translated to intrinsics.

zx485
  • 28,498
  • 28
  • 50
  • 59
  • 1
    Since this still needs a vector of all-ones, it doesn't seem particularly any better than the PXOR proposed in the question. – Nate Eldredge Jun 02 '22 at 14:02