If a bit position is the same in both values, no change is needed in either. If it's opposite, they both need to invert.
XOR with 1 flips a bit; XOR with 0 is a no-op.
So what we want is a value that has a 1
everywhere there's a bit-difference between the inputs, and a 0 everywhere else. That's exactly what a XOR b
does.
Simply mask this bit-difference to only keep the differences in the bits we want to swap, and we have a bit-swap in 3 XORs + 1 AND.
Your mask is (1UL << position) -1
. One less than a power of 2 has all the bits below that set. Or more generally with a high and low position for your bit-range: (1UL << highpos) - (1UL << lowpos)
. Whether a lookup-table is faster than bit-set / sub depends on the compiler and hardware. (See @PaxDiablo's answer for the LUT suggestion).
// Portable C:
//static inline
void swapBits_char(unsigned char *A, unsigned char *B)
{
const unsigned highpos = 4, lowpos=0; // function args if you like
const unsigned char mask = (1UL << highpos) - (1UL << lowpos);
unsigned char tmpA = *A, tmpB = *B; // read into locals in case A==B
unsigned char bitdiff = tmpA ^ tmpB;
bitdiff &= mask; // clear all but the selected bits
*A = tmpA ^ bitdiff; // flip bits that differed
*B = tmpB ^ bitdiff;
}
//static inline
void swapBit_uint(unsigned *A, unsigned *B, unsigned mask)
{
unsigned tmpA = *A, tmpB = *B;
unsigned bitdiff = tmpA ^ tmpB;
bitdiff &= mask; // clear all but the selected bits
*A = tmpA ^ bitdiff;
*B = tmpB ^ bitdiff;
}
(Godbolt compiler explorer with gcc for x86-64 and ARM)
This is not an xor-swap. It does use temporary storage. As @chux's answer on a near-duplicate question demonstrates, a masked xor-swap requires 3 AND operations as well as 3 XOR. (And defeats the only benefit of XOR-swap by requiring a temporary register or other storage for the &
results.) This answer is a modified copy of my answer on that other question.
This version only requires 1 AND. Also, the last two XORs are independent of each other, so total latency from inputs to both outputs is only 3 operations. (Typically 3 cycles).
For an x86 asm example of this, see this code-golf Exchange capitalization of two strings in 14 bytes of x86-64 machine code (with commented asm source)
– Peter Cordes Jun 13 '18 at 14:10