How would you do that in C? (Example: 10110001 becomes 10001101 if we had to mirror 8 bits). Are there any instructions on certain processors that would simplify this task?
-
4"mirror" is an OK word but most folks would probably call it "bit reversal". – President James K. Polk Nov 22 '10 at 13:40
-
@GregS: Thanks, that explains why I had trouble googling it. – Thomas Nov 22 '10 at 13:42
-
3From a deleted link-only answer: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious (some more efficient methods are listed, too). On modern x86, you'd probably want to use SSSE3 `pshufb` as a parallel nibble LUT. (Same for any other SIMD ISA with a byte shuffle.) ARM has an `rbit` instruction that does the whole task in one efficient instruction. – Peter Cordes Jul 13 '18 at 10:11
-
Does this answer your question? [Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C](https://stackoverflow.com/questions/746171/efficient-algorithm-for-bit-reversal-from-msb-lsb-to-lsb-msb-in-c) – phuclv Apr 29 '22 at 15:09
-
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481 and https://clang.llvm.org/docs/LanguageExtensions.html#builtin-bitreverse – artless noise Oct 27 '22 at 19:29
12 Answers
It's actually called "bit reversal", and is commonly done in FFT scrambling. The O(log N) way is (for up to 32 bits):
uint32_t reverse(uint32_t x, int bits)
{
x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1); // Swap _<>_
x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2); // Swap __<>__
x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4); // Swap ____<>____
x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8); // Swap ...
x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16); // Swap ...
return x >> (32 - bits);
}
Maybe this small "visualization" helps:
An example of the first 3 assignment, with a uint8_t
example:
b7 b6 b5 b4 b3 b2 b1 b0
-> <- -> <- -> <- -> <-
----> <---- ----> <----
----------> <----------
Well, if we're doing ASCII art, here's mine:
7 6 5 4 3 2 1 0
X X X X
6 7 4 5 2 3 0 1
\ X / \ X /
X X X X
/ X \ / X \
4 5 6 7 0 1 2 3
\ \ \ X / / /
\ \ X X / /
\ X X X /
X X X X
/ X X X \
/ / X X \ \
/ / / X \ \ \
0 1 2 3 4 5 6 7
It kind of looks like FFT butterflies. Which is why it pops up with FFTs.

- 41,631
- 10
- 72
- 96
-
You're probably better off using a `uint_fast32_t` for `x` rather than a signed type (that may be less than 32 bits wide). Those shifts are UB on signed types. – Toby Speight Jul 13 '18 at 12:19
-
@TobySpeight Yeah, I copied this from C++ code where the type was `uint32_t` originally and there were overloads for different widths. It predates `uint_fast32_t` being available in the compiler. – Mike DeSimone Jul 14 '18 at 14:42
-
1@TobySpeight: right shift of a signed integer is implementation defined, *not* UB. (As long as the shift count is smaller than the type width, of course, same as for unsigned.) It's an arithmetic right shift on all implementations I'm aware of, but those are all 2's complement machines. C also allows logical right shift. But anyway, yes, `uint32_t` is a good choice, and arithmetic right shifts were a bug. – Peter Cordes Jul 14 '18 at 14:59
Per Rich Schroeppel in this MIT memo (if you can read past the assembler), the following will reverse the bits in an 8bit byte providing that you have 64bit arithmetic available:
byte = (byte * 0x0202020202ULL & 0x010884422010ULL) % 1023;
Which sort of fans the bits out (the multiply), selects them (the and) and then shrinks them back down (the modulus).
Is it actually an 8bit quantity that you have?

- 99,986
- 12
- 185
- 204
-
1While very clever, on many platforms divide (`/`) and modulo (`%`) are expensive, multi-cycle operations, especially if it isn't a power of 2 that the compiler can optimize into a bit mask operation. – Mike DeSimone Nov 22 '10 at 16:08
-
This is clever but must be at least 20 times slower than the obvious lookup table approach.. – R.. GitHub STOP HELPING ICE Nov 22 '10 at 16:41
-
2@R: depends on your CPU. I'll bet it's three cycles on a modern Intel, all of which are amenable to parallel parts of the pipeline, whereas a table based approach has the major disadvantage of, at best, occupying valuable cache and, at worst, causing a pipeline stall while memory is accessed. – Tommy Nov 22 '10 at 16:51
-
@R.: `% 1023` costs about 2 multiplies + a sub to do it in terms of `x - (x / 1023) * 1023`, using a fixed-point multiplicative inverse for the `/1023` ([Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935)). gcc and clang choose to do the multiply by 1023 with shift/sub because it's close to a power of 2. On a modern x86 (with 3-cycle multiply) it looks like about 13 cycle latency for the whole thing, following the dep chain through the `imul` and so on. Good ILP, but a table lookup would have better ILP. – Peter Cordes Jul 13 '18 at 10:09
-
([Godbolt link for previous comment](https://godbolt.org/g/iUYxnV)) Of course if you're tuning for modern x86, you'd use SSSE3 `pshufb` to do parallel 4-bit lookups, and bit-reverse a whole 32-bit int in a few shuffles. – Peter Cordes Jul 13 '18 at 10:10
Nearly a duplicate of Most Efficient Algorithm for Bit Reversal ( from MSB->LSB to LSB->MSB) in C (which has a lot of answers, including one AVX2 answer for reversing every 8-bit char in an array).
X86
On x86 with SSSE3 (Core2 and later, Bulldozer and later), pshufb
(_mm_shuffle_epi8
) can be used as a nibble LUT to do 16 lookups in parallel. You only need 8 lookups for the 8 nibbles in a single 32-bit integer, but the real problem is splitting the input bytes into separate nibbles (with their upper half zeroed). It's basically the same problem as for pshufb
-based popcount.
avx2 register bits reverse shows how to do this for a packed vector of 32-bit elements. The same code ported to 128-bit vectors would compile just fine with AVX.
It's still good for a single 32-bit int because x86 has very efficient round-trip between integer and vector regs: int bitrev = _mm_cvtsi128_si32 ( rbit32( _mm_cvtsi32_si128(input) ) );
. That only costs 2 extra movd
instructions to get an integer from an integer register into XMM and back. (Round trip latency = 3 cycles on an Intel CPU like Haswell.)
ARM:
rbit
has single-cycle latency, and does a whole 32-bit integer in one instruction.

- 328,167
- 45
- 605
- 847
The naive / slow / simple way is to extract the low bit of the input and shift it into another variable that accumulates a return value.
#include <stdint.h>
uint32_t mirror_u32(uint32_t input) {
uint32_t returnval = 0;
for (int i = 0; i < 32; ++i) {
int bit = input & 0x01;
returnval <<= 1;
returnval += bit; // Shift the isolated bit into returnval
input >>= 1;
}
return returnval;
}
For other types, the number of bits of storage is sizeof(input) * CHAR_BIT
, but that includes potential padding bits that aren't part of the value. The fixed-width types are a good idea here.
The +=
instead of |=
makes gcc compile it more efficiently for x86 (using x86's shift-and-add instruction, LEA). Of course, there are much faster ways to bit-reverse; see the other answers. This loop is good for small code size (no large masks), but otherwise pretty much no advantage.
Compilers unfortunately don't recognize this loop as a bit-reverse and optimize it to ARM rbit
or whatever. (See it on the Godbolt compiler explorer)

- 328,167
- 45
- 605
- 847

- 11,655
- 1
- 30
- 43
Fastest approach is almost sure to be a lookup table:
out[0]=lut[in[3]];
out[1]=lut[in[2]];
out[2]=lut[in[1]];
out[3]=lut[in[0]];
Or if you can afford 128k of table data (by afford, I mean cpu cache utilization, not main memory or virtual memory utilization), use 16-bit units:
out[0]=lut[in[1]];
out[1]=lut[in[0]];

- 208,859
- 35
- 376
- 711
If you are interested in a more embedded approach, when I worked with an armv7a
system, I found the RBIT
command.
So within a C function using GNU extended asm
I could use:
uint32_t bit_reverse32(uint32_t inp32)
{
uint32_t out = 0;
asm("RBIT %0, %1" : "=r" (out) : "r" (inp32));
return out;
}
There are compilers which expose intrinsic C wrappers like this. (armcc __rbit
) and gcc
also has some intrinsic via ACLE but with gcc-arm-linux-gnueabihf
I could not find __rbit
C so I came up with the upper code.
I didn't look, but I suppose on other platforms you could create similar solutions.

- 421
- 3
- 15
-
This is an ARM instruction. Most other ISAs don't have an equivalent instruction. Rust does have a builtin for it, [`return foo.reverse_bits()`](https://doc.rust-lang.org/std/primitive.u32.html#method.reverse_bits). But even C++20 `
` omitted it, only providing `std::byte_swap` (https://en.cppreference.com/w/cpp/header/bit) – Peter Cordes Aug 04 '22 at 09:29
I think I would make a lookup table of bitpatterns 0-255. Read each byte and with the lookup table reverse that byte and afterwards arrange the resulting bytes appropriately.

- 6,514
- 8
- 32
- 37
-
1The really cool thing is that an 8 bit table lookup can be done in a single instruction (XLAT) in Intel x86 assembly. Not one of the fastest instructions, but it does it in a single relatively quick instruction! :-) – Brian Knoblauch Nov 23 '10 at 19:58
quint64 mirror(quint64 a,quint8 l=64) {
quint64 b=0;
for(quint8 i=0;i<l;i++) {
b|=(a>>(l-i-1))&((quint64)1<<i);
}
return b;
}
This function mirroring less then 64 bits. For instance it can mirroring 12 bits.
quint64 and quint8 are defined in Qt. But it possible redefine it in anyway.

- 169,130
- 45
- 262
- 238

- 1
- 1
If you have been staring at Mike DeSimone's great answer (like me), here is a "visualization" on the first 3 assignment, with a uint8_t
example:
b7 b6 b5 b4 b3 b2 b1 b0
-> <- -> <- <- -> <- ->
----> <---- ----> <----
----------> <----------
So first, bitwise swap, then "two-bit-group" swap and so on.

- 421
- 3
- 15
-
This would be better as an edit to [the answer you're referencing](https://stackoverflow.com/questions/4245936/mirror-bits-of-a-32-bit-word/4245986#4245986); it doesn't really stand alone as an answer. – Peter Cordes Oct 16 '18 at 15:50
-
-
Thanks for helping make Stack Overflow better; that's a good addition to that answer. – Peter Cordes Oct 18 '18 at 14:51
For sure most people won't consider my approach neither as elegant nor efficient: it's aimed at being portable and somehow "straightforward".
#include <limits.h> // CHAR_BIT
unsigned bit_reverse( unsigned s ) {
unsigned d;
int i;
for( i=CHAR_BIT*sizeof( unsigned ),d=0; i; s>>=1,i-- ) {
d <<= 1;
d |= s&1;
}
return d;
}
This function pulls the least significant bit from the source bistring s
and pushes it as the most significant bit in the destination bitstring d
.
You can replace unsigned
data type with whatever suits your case, from unsigned char
(CHAR_BIT
bits, usually 8) to unsigned long long
(128 bits in modern 64-bit CPUs).
Of course, there can be CPU-specific instructions (or instruction sets) that could be used instead of my plain C code.
But than that wouldn't be "C language" but rather assembly instruction(s) in a C wrapper.

- 3,107
- 2
- 22
- 25
-
If you're aiming for portable, use `CHAR_BIT` instead of hard-coding `8`. Some modern DSPs are word-addressable and thus have 16, 24, or even 32-bit `char`, so it's not just a matter of legacy machines with 9-bit bytes or whatever. – Peter Cordes Apr 09 '21 at 20:47
int mirror (int input)
{// return bit mirror of 8 digit number
int tmp2;
int out=0;
for (int i=0; i<8; i++)
{
out = out << 1;
tmp2 = input & 0x01;
out = out | tmp2;
input = input >> 1;
}
return out;
}
-
-
Same loop a Simone's answer, but for 8-bit and with different var names. Actually, the different order of operations in Simone's answer is a bug. – Peter Cordes Jul 13 '18 at 23:41