Merge bit sequences a and b according to a mask

Question

According to the bit twiddling hacks website, the operation

unsigned int a;    // value to merge in non-masked bits
unsigned int b;    // value to merge in masked bits
unsigned int mask; // 1 where bits from b should be selected; 0 where from a.
unsigned int r;    // result of (a & ~mask) | (b & mask) goes here

r = a ^ ((a ^ b) & mask);

allows to merge two bit sequences a and b according to a mask. I was wondering:

Whether this operation had a specific/usual name?
Whether specific assembly instruction was existing for this operation on some instruction set?

*"Whether specific assembly instruction was existing for this operation on some instruction set?"* - I wouldn't know of any instruction set that **didn't** have AND and XOR operations. — IInspectable, Sep 02 '16 at 00:32
But maybe some architectures have this operation wired in hardware to do it in a single instruction instead of several ones. — Vincent, Sep 02 '16 at 00:38
That's a pretty standard calculation for [Ternary Raster Operations](https://msdn.microsoft.com/en-us/library/dd145130.aspx), so yes, I would assume that dedicated circuitry is available. — IInspectable, Sep 02 '16 at 00:50

Peter Cordes · Accepted Answer · 2021-12-13T18:18:12.480

I'd call this a bit-blend, using the masked-xor method. Related: this Q&A explains in detail how/why those boolean operations accomplish this.

In SSE/AVX programming, selective copying from one vector to another based on a mask is called a blend. SSE4.1 added instructions like PBLENDVB xmm1, xmm2/m128, <XMM0>, where the implicit operand XMM0 controls which bytes of the src overwrite corresponding bytes in the dst. (Without SSE4.1, you'd usually AND and ANDNOT the mask onto two vectors, and OR that together; the masked-xor trick has less instruction-level parallelism, and probably requires at least as many MOV instructions to copy registers as the OR method.)

There's also an immediate blend instruction, pblendw, where the mask is an 8-bit immediate instead of a register. And there are 32-bit and 64-bit immediate blends (blendps, blendpd, vpblendd) and variable blends (blendvps, blendvpd).

IDK if other SIMD instruction sets (NEON, AltiVec, whatever MIPS calls theirs, etc.) also call them "blends" or not.

SSE/AVX (or x86 integer instructions) don't provide anything better than the usual bitwise XOR/AND for doing bitwise (instead of element-wise) blends until AVX512F.

AVX512F can do the bitwise version of this (or any other bitwise ternary function) with a single vpternlogd or vpternlogq instruction. (The only difference between d and q element sizes is if you use a mask register for merge-masking or zero-masking the destination, but that didn't stop Intel from making separate intrinsics even for the no-mask case:

__m512i _mm512_ternarylogic_epi32 (__m512i a, __m512i b, __m512i c, int imm8) and the equivalent ..._epi64 version.

The imm8 immediate byte is a truth table. Every bit of the destination is determined independently, from the corresponding bits of a, b and c by using them as a 3-bit index into the truth table. i.e. as imm8[a:b:c].

AVX512 will be fun to play with when it eventually appears in mainstream desktop/laptop CPUs, but that's probably a couple years away still.

I don't see why the xor trick doesn't work for vectors? Yes you need to expand the mask to be the same size as the vectors you are blending, but that's true for the AND+ANDN+OR solution as well? In any case, the xor only saves a single `not` vs the usual approach so if you have ANDN the advantage disappears (since the not is folded into the and). — BeeOnRope, Sep 02 '16 at 15:29
@BeeOnRope: `(a ^ b)` is bitwise within elements, which isn't what you want for an element-granularity blend. Or does it all cancel out in the end? I didn't actually look very hard at the xor method! — Peter Cordes, Sep 02 '16 at 15:48
Yes, but I'm referring to the "without SSE 4.1" part where you said the usual way was to use ANDN because the xor trick doesn't apply. The xor method works with exactly the same type of mask as the AND approach (yes, both are bit granular). So yes, if you have a 16-bit mask and what to use that to control a 128-bit blend of byte elements you need to expand each bit in the mask to a full byte (0 or 255) - but that's true of both approaches. The xor trick seems equally applicable here (but in a tie, not a win, since ANDN exists). — BeeOnRope, Sep 02 '16 at 16:10
@BeeOnRope: derp, I was totally missing the point that the xor version works fine with the same mask, and the AND/ANDN/XOR method is also bitwise. — Peter Cordes, Sep 02 '16 at 16:18

Merge bit sequences a and b according to a mask

1 Answers1

Linked

Related