sequence of SSE instructions for a conditional expression

Question

There are at least 2 ways to express,

a ? b : c

given that true is all bits on (0xff...), and false is all bits off (0).

The first is,

(a & b) | (~a & c)

and the second is,

(a & (b ^ c)) ^ c

The relevant sequence of SSE instructions would be, respectively,

andxx
andnxx
orxx

and

xorxx
andxx
xorxx

Which way would you prefer and why?

The first has a shorter dependency chain: The `andxx` and `andnxx` can run in parallel. — Raymond Chen, Dec 13 '21 at 14:20
What number of clocks do they need, considering chaches and pipelines? -- OT: I yet have to see a C compiler implementation that uses "-1" or "~0" (all ones) as `true`. — the busybee, Dec 13 '21 at 14:21
@thebusybee It seems that OP is programming in assembly (not sure why there is a [c] tag). With SSE, the various comparison instructions indicate truth with all bits set so you can use that directly as a mask. — fuz, Dec 13 '21 at 14:34
@500-InternalServerError `((a^b)&c)^a` selects either `0^a`, if `c == 0`, or `(a^b)^a` if `c==0xf..f`. — EOF, Dec 13 '21 at 15:16
If `b` and `c` are known at compile time or are loop-invariants, the second version will save one µop, if not, it increases the dependency chain. If you have SSE4, most of the time, you should use `blendvps`/`blendvpd` or `pblendvb` — chtz, Dec 13 '21 at 15:18
@500-InternalServerError Someone else edited the question for clarity, but basically `&` has a higher precedence over `^` and `|`, so the parentheses aren't necessary. — xiver77, Dec 13 '21 at 18:01
Near duplicate of [How to use bitwise operations to assign specific bits of a byte according to two other bytes? (bit blend according to a mask)](https://stackoverflow.com/q/68048922) where I compare critical-path latencies. (I didn't consider possible `mov` instructions to keep original values around.) I think it actually is a duplicate of that and a couple related Q&As about SSE and performance. — Peter Cordes, Dec 13 '21 at 18:48
@PeterCordes Your link helped. I wouldn't have asked this question if I had found that question. Still, I posted some information not in your link as an answer to my own question. — xiver77, Dec 13 '21 at 18:59
Yeah, just saw your answer. The other duplicates I was going to use were [Merge bit sequences a and b according to a mask](https://stackoverflow.com/q/39282691) (for SSE4.1 blends) and [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) (for how to evaluate performance of short sequences: latency of dep chains, front-end cost, and back-end uops.) — Peter Cordes, Dec 13 '21 at 19:05

score 0 · Answer 1 · answered Dec 13 '21 at 18:51

I ended up using both. They have different advantages.

typedef __m128 p4f_t;

static inline p4f_t p4f_if(p4f_t a, p4f_t b, p4f_t c) {
    return p4f_or(p4f_and(a, b), p4f_andn(a, c));
}

static inline p4f_t p4f_if2(p4f_t a, p4f_t b, p4f_t c) {
    return p4f_xor(p4f_and(a, p4f_xor(b, c)), c);
}

I posted the wrapper code at the end. negif and negif2 below produce different object codes while doing the same job.

static inline p4f_t p4f_neg(p4f_t a) {
    p4f_t mask = p4f_fill_bitCopy(0x80000000);
    return p4f_xor(a, mask);
}

p4f_t negif(p4f_t a, p4f_t b) {
    return p4f_if(b, p4f_neg(a), a);
}

p4f_t negif2(p4f_t a, p4f_t b) {
    return p4f_if2(b, p4f_neg(a), a);
}

This is the disassembly.

<negif>:
movaps xmm3,XMMWORD PTR .LC0[rip]
movaps xmm2,xmm1
andnps xmm2,xmm0
xorps  xmm3,xmm0
andps  xmm1,xmm3
orps   xmm1,xmm2
movaps xmm0,xmm1
ret

<negif2>:
andps  xmm1,XMMWORD PTR .LC0[rip]
xorps  xmm0,xmm1
ret

Why is negif2 optimized better?

neg(a) = xor(a, mask)
if2(a, b, c) = xor(and(a, xor(b, c)), c)
negif2(a, b) = if2(b, neg(a), a)
             = xor(and(b, xor(neg(a), a)), a)
             = xor(and(b, xor(xor(a, mask), a)), a)
             = xor(and(b, mask), a)

negif2 can be simplified by trivial substitution while negif cannot, so at least the current version of gcc cannot optimize negif properly.

However, as @RaymondChen and @chtz mentioned in the comments, and and andn doesn't have a dependency on each other in the first if; they can run in parallel, so in other cases, p4f_if should be a choice over p4f_if2.

This is the wrapper code.

static inline p4f_t p4f_fill(float a) {
    return _mm_set1_ps(a);
}

static inline p4f_t p4f_fill_bitCopy(uint32_t a) {
    float a_;
    memcpy(&a_, &a, sizeof(a));
    return p4f_fill(a_);
}

static inline p4f_t p4f_and(p4f_t a, p4f_t b) {
    return _mm_and_ps(a, b);
}

static inline p4f_t p4f_andn(p4f_t a, p4f_t b) {
    return _mm_andnot_ps(a, b);
}

static inline p4f_t p4f_or(p4f_t a, p4f_t b) {
    return _mm_or_ps(a, b);
}

static inline p4f_t p4f_xor(p4f_t a, p4f_t b) {
    return _mm_xor_ps(a, b);
}

Interesting missed-optimization in `negif`, both GCC and clang miss that simplification with the and/andn version: https://godbolt.org/z/Yhen46eK8. I wouldn't have written that function that way in the first place; I'd just have done `xor(a, and(b, signbit_mask))`. The only reason for using `p4f_if2` (masked xor) at all is that you already know that optimization, so you might as well simplify the one use-case instead of having another version of your bit-blend function. If you use `#ifdef __SSE4_1__` to use `blendvps` in both, that optimization will disappear on you. — Peter Cordes, Dec 13 '21 at 19:03

sequence of SSE instructions for a conditional expression

1 Answers1