The standard bithack for this is "Merge bits from two values according to a mask" - I added your variable names for the inputs to the existing comments from Sean Anderson's collection of bithacks.
unsigned int a; // (ByteToAlter) value to merge in non-masked bits
unsigned int b; // (ValuesToAssign) value to merge in masked bits
unsigned int mask; // (BitsToAssign) 1 where bits from b should be selected; 0 where from a.
unsigned int r; // result of (a & ~mask) | (b & mask) goes here
r = a ^ ((a ^ b) & mask);
As the bithack comments note, the straight-forward way is (a & ~mask) | (b & mask)
- clear the bits you're not keeping in each input, then OR them together.
How a ^ ((a ^ b) & mask)
works:
bitdiff = a^b
is the bitwise difference between those inputs.
It has a 1
where they differ, and a 0
where they're the same. (By definition of XOR).
a ^ bitdiff
would flip every bit in a
that's different from b
. In fact, b == a ^ bitdiff
.
One way to show that's true is that XOR is associative, so a ^ (a^b)
= (a ^ a) ^ b
.
And x^x
= 0, just like x-x = 0
.
0 ^ x = x
, so (a^a) ^ b
= 0 ^ b
= b
.
But if we mask the bitdiff to only set bits of a
to bits from b
at certain positions, we've achieved our goal: bitwise blend according to a mask. blend = a ^ (bitdiff & mask);
Special cases of the (a & ~mask) | (b & mask)
simple version
If your inputs are arranged so ValuesToAssign
only has any 1
bits at positions selected by the mask, you can optimize away the b & mask
part, leaving just (a & ~mask) | b
. (Eraklon's answer). Clear the unselected bits, then OR in the new values to set any bits that should be set.
A further special case is when ValuesToAssign == BitsToAssign
, i.e. the modification only ever sets bits, never clearing them. That's what OR does, so of course this case optimizes to a | b
, with no need to clear bits in a
before ORing.
Efficiency:
r = a ^ ((a ^ b) & mask);
is only 3 boolean operations,
vs. 4 for (a & ~mask) | (b & mask)
if all three inputs are runtime-variables. (One bitwise NOT, two AND, plus an OR).
If mask
is a constant, then constant-propagation into ~mask
makes it a constant, and most machines can do AND-immediate with at least an 8-bit AND mask. So you'd still only need 3 instruction: two ANDs (with inverse constants) and an OR.
Some machines (like x86 with BMI1) even have an andn
instruction that does x & ~y
, allowing the (a & ~mask) | (b & mask)
to be done with 3 instructions.
For latency, (a & ~mask) | (b & mask)
has more instruction-level parallelism. If mask
and ~mask
are ready ahead of a
and b
, there are only two parallel AND operations, and an OR, from a
and b
inputs being ready to the output being ready. On a normal superscalar machine (that can do two independent AND ops in the same cycle), that's only 2 cycle latency from inputs to outputs. (Again, assuming mask
is ready early, or that an instruction like andn
exists to do a & ~mask
in a single operation).
If the critical path goes through mask
(i.e. it's not ready early), and you don't have an instruction like andn
to do ~
and &
as one operation, the latency from mask
to result
is 3 operations, ~
, &
, and |
. (Assuming the b & mask
can run in parallel without slowing down any of those three).
Latencies for (a & ~mask) | (b & mask)
on a modern OoO exec machine.
(Not the same thing as throughput cost)
- a -> result: 2 cycles
- b -> result: 2 cycles
- mask -> result: 3 cycles (or 2 on some machines)
But the bit-difference way doesn't have any ILP; it's a chain of 3 operations. a^b
requires both of those inputs to be ready for the first step, then mask
needs to be ready for the & mask
step. The final a ^ ...
step is using inputs that were already needed earlier. So it's only 3 operations, but they're serial.
Latencies for a ^ ((a ^ b) & mask)
on a modern OoO exec machine.
- a -> result: 3 cycles
- b -> result: 3 cycles
- mask -> result: 2 cycles
Related Q&As:
Merge bit sequences a and b according to a mask - this is called a blend in SIMD programming. IDK if anyone else uses the "bit-blend" term I like to use for this operation, but IMO that clearly describes it. x86's AVX-512 extension has a 3-input boolean operation vpternlog
with a truth-table from an 8-bit immediate, and thus can do it in a single instruction.
Swapping bits at a given point between two bytes - The same bithack idea, but applying the masked bit-difference to both inputs to exchange bits at the mask positions.
https://catonmat.net/low-level-bit-hacks - starts gently with an intro to the operators (like ^
being bitwise XOR). Includes bithacks that use + and - (and the carry propagation effects of hitting a 1 or 0, like x & (x-1)
to clear the right-most set bit.)
https://agner.org/optimize/ for more about tuning for modern CPUs.