How to use bitwise operations to assign specific bits of a byte according to two other bytes? (bit blend according to a mask)

Question

I have 3 bytes.

One byte determines which bits of the 3rd byte need to change (1 indicates bit needs to change 0 means no change should occur).
The 2nd byte determines whether the changing bits are assigned 1 or 0.
The 3rd byte is where changes take place.

Is there a way I can achieve this using bitwise operators? If so, how? A simple formula or program to achieve this would be nice (preferably in c).

Example:

BitsToAssign:   0b01101011
ValuesToAssign: 0b01000010
ByteToAlter:    0b11110001

ExpectedResult: 0b11010010

I tried reading this three times. I have no idea what you are asking, other than "can i achieve this with bit wise operations" the answer is yes. https://www.programiz.com/c-programming/bitwise-operators — Sorenp, Jun 19 '21 at 17:13
almost duplicate: [Merge bit sequences a and b according to a mask](https://stackoverflow.com/q/39282691). And [Swapping bits at a given point between two bytes](https://stackoverflow.com/q/3562347) goes farther, producing two outputs with certain bit-positions swapped. (A masked xor bit-difference.) — Peter Cordes, Jun 19 '21 at 17:35

Peter Cordes · Accepted Answer · 2021-06-19T18:35:15.190

The standard bithack for this is "Merge bits from two values according to a mask" - I added your variable names for the inputs to the existing comments from Sean Anderson's collection of bithacks.

unsigned int a;    // (ByteToAlter)    value to merge in non-masked bits
unsigned int b;    // (ValuesToAssign) value to merge in masked bits
unsigned int mask; // (BitsToAssign)   1 where bits from b should be selected; 0 where from a.

unsigned int r;    // result of (a & ~mask) | (b & mask) goes here

r = a ^ ((a ^ b) & mask);

As the bithack comments note, the straight-forward way is (a & ~mask) | (b & mask) - clear the bits you're not keeping in each input, then OR them together.

How `a ^ ((a ^ b) & mask)` works:

bitdiff = a^b is the bitwise difference between those inputs.
It has a 1 where they differ, and a 0 where they're the same. (By definition of XOR).

a ^ bitdiff would flip every bit in a that's different from b. In fact, b == a ^ bitdiff.
One way to show that's true is that XOR is associative, so a ^ (a^b) = (a ^ a) ^ b.
And x^x = 0, just like x-x = 0.
0 ^ x = x, so (a^a) ^ b = 0 ^ b = b.

But if we mask the bitdiff to only set bits of a to bits from b at certain positions, we've achieved our goal: bitwise blend according to a mask. blend = a ^ (bitdiff & mask);

Special cases of the `(a & ~mask) | (b & mask)` simple version

If your inputs are arranged so ValuesToAssign only has any 1 bits at positions selected by the mask, you can optimize away the b & mask part, leaving just (a & ~mask) | b. (Eraklon's answer). Clear the unselected bits, then OR in the new values to set any bits that should be set.

A further special case is when ValuesToAssign == BitsToAssign, i.e. the modification only ever sets bits, never clearing them. That's what OR does, so of course this case optimizes to a | b, with no need to clear bits in a before ORing.

Efficiency:

r = a ^ ((a ^ b) & mask); is only 3 boolean operations,
vs. 4 for (a & ~mask) | (b & mask) if all three inputs are runtime-variables. (One bitwise NOT, two AND, plus an OR).

If mask is a constant, then constant-propagation into ~mask makes it a constant, and most machines can do AND-immediate with at least an 8-bit AND mask. So you'd still only need 3 instruction: two ANDs (with inverse constants) and an OR.

Some machines (like x86 with BMI1) even have an andn instruction that does x & ~y, allowing the (a & ~mask) | (b & mask) to be done with 3 instructions.

For latency, (a & ~mask) | (b & mask) has more instruction-level parallelism. If mask and ~mask are ready ahead of a and b, there are only two parallel AND operations, and an OR, from a and b inputs being ready to the output being ready. On a normal superscalar machine (that can do two independent AND ops in the same cycle), that's only 2 cycle latency from inputs to outputs. (Again, assuming mask is ready early, or that an instruction like andn exists to do a & ~mask in a single operation).

If the critical path goes through mask (i.e. it's not ready early), and you don't have an instruction like andn to do ~ and & as one operation, the latency from mask to result is 3 operations, ~, &, and |. (Assuming the b & mask can run in parallel without slowing down any of those three).

Latencies for (a & ~mask) | (b & mask) on a modern OoO exec machine.
(Not the same thing as throughput cost)

a -> result: 2 cycles
b -> result: 2 cycles
mask -> result: 3 cycles (or 2 on some machines)

But the bit-difference way doesn't have any ILP; it's a chain of 3 operations. a^b requires both of those inputs to be ready for the first step, then mask needs to be ready for the & mask step. The final a ^ ... step is using inputs that were already needed earlier. So it's only 3 operations, but they're serial.

Latencies for a ^ ((a ^ b) & mask) on a modern OoO exec machine.

a -> result: 3 cycles
b -> result: 3 cycles
mask -> result: 2 cycles

Related Q&As:

Merge bit sequences a and b according to a mask - this is called a blend in SIMD programming. IDK if anyone else uses the "bit-blend" term I like to use for this operation, but IMO that clearly describes it. x86's AVX-512 extension has a 3-input boolean operation vpternlog with a truth-table from an 8-bit immediate, and thus can do it in a single instruction.
Swapping bits at a given point between two bytes - The same bithack idea, but applying the masked bit-difference to both inputs to exchange bits at the mask positions.
https://catonmat.net/low-level-bit-hacks - starts gently with an intro to the operators (like ^ being bitwise XOR). Includes bithacks that use + and - (and the carry propagation effects of hitting a 1 or 0, like x & (x-1) to clear the right-most set bit.)
https://agner.org/optimize/ for more about tuning for modern CPUs.

Actually modern compilers will generate the same code for both. No difference at all. That is the reason I always say that compilers are usually far better in micro-optimizations than humans https://godbolt.org/z/9ovvxrhvv — 0___________, Jun 19 '21 at 18:39
@0___________ thanks, usually I do check compiler output. (Especially since 2-operand ISAs like x86 may need extra `mov` instructions.) Interesting that GCC knows this optimization. Clang also knows it (https://godbolt.org/z/x57vKzfds), and with `-march=haswell` will use BMI1 `andn` to compile both versions to `andn` / `and` / `or` (with no MOV needed). But without BMI1, it chooses to compile both as written. (Requiring one MOV in each). — Peter Cordes, Jun 19 '21 at 18:46
@0___________: Unfortunately GCC `-O3 -march=haswell` applies this peephole even though andn/and/or would be more efficient :/ https://godbolt.org/z/hKrsTfGaT — Peter Cordes, Jun 19 '21 at 19:10
Thanks for the lengthy explanation:) it is really helpful even though I don't completely understand all of the parts. — Mr. 0099, Jun 21 '21 at 16:36
@Mr.0099: If want to see it in action to understand better, try it yourself with a 4-bit example. (Or a simple example in the low 4 bits or separate hex digits of a `char` so you can single-step it in a debugger. Especially if you assign each intermediate result to a separate C variable, if you want to do source-level debugging instead of looking at the asm which necessarily has to use separate machine instructions.) Or if you meant not understanding all the microarchitectural details, or latency / dependency chains, that's optional and can take some time to wrap your head around. — Peter Cordes, Jun 21 '21 at 16:40

booyaakaashaa · Answer 2 · 2021-06-19T17:28:19.553

2

BitsToAssign & ValuesToAssign | ~BitsToAssign & ByteToAlter Please try this.

Explanation: Select ValueToAssign if BitsToAssign is True else select ByteToAlter

edited Jun 19 '21 at 17:28

answered Jun 19 '21 at 17:20

booyaakaashaa

105
12

3

You want `~`, bitwise NOT, rather than `!` logical NOT. – Peter Cordes Jun 19 '21 at 17:21
1

Also, that's more operations than `a ^ ((a ^ b) & mask);`. [Merge bit sequences a and b according to a mask](https://stackoverflow.com/q/39282691) For latency, this way has more instruction-level parallelism, though, and doesn't require bittoalter and valuestoassign to both be ready before making any progress. Although if your BitsToAssign is a constant, masking with it directly and inverted can optimize away the `~`, and for single bytes most ISAs can do an AND-immediate with an 8-bit mask. – Peter Cordes Jun 19 '21 at 17:29
1

Your explanation part and using keywords for operators really helped me understand how your formula gets to my desired result. Thank you so much. – Mr. 0099 Jun 21 '21 at 16:34

score 1 · Answer 3 · answered Jun 19 '21 at 18:03

void printbin(unsigned char val)
{
    unsigned char mask = 1 << (sizeof(mask) * 8 - 1);
    while(mask)
    {
        printf("%c", val & mask ? '1' : '0');
        mask >>= 1;
    }
}

unsigned merge(unsigned ByteToAlter, unsigned ValuesToAssign, unsigned BitsToAssign)
{
    unsigned clearMask = ~BitsToAssign;

    return (ByteToAlter & clearMask) | (ValuesToAssign & BitsToAssign);
}

int main(void)
{
    printbin(merge(0b11110001, 0b01000010, 0b01101011));
    
}

How to use bitwise operations to assign specific bits of a byte according to two other bytes? (bit blend according to a mask)

3 Answers3

How `a ^ ((a ^ b) & mask)` works:

Special cases of the `(a & ~mask) | (b & mask)` simple version

Efficiency:

Linked

How to use bitwise operations to assign specific bits of a byte according to two other bytes? (bit blend according to a mask)

3 Answers3

How a ^ ((a ^ b) & mask) works:

Special cases of the (a & ~mask) | (b & mask) simple version

Efficiency:

Linked

How `a ^ ((a ^ b) & mask)` works:

Special cases of the `(a & ~mask) | (b & mask)` simple version