mirror bits of a 32 bit word

Question

How would you do that in C? (Example: 10110001 becomes 10001101 if we had to mirror 8 bits). Are there any instructions on certain processors that would simplify this task?

"mirror" is an OK word but most folks would probably call it "bit reversal". — President James K. Polk, Nov 22 '10 at 13:40
@GregS: Thanks, that explains why I had trouble googling it. — Thomas, Nov 22 '10 at 13:42
From a deleted link-only answer: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious (some more efficient methods are listed, too). On modern x86, you'd probably want to use SSSE3 `pshufb` as a parallel nibble LUT. (Same for any other SIMD ISA with a byte shuffle.) ARM has an `rbit` instruction that does the whole task in one efficient instruction. — Peter Cordes, Jul 13 '18 at 10:11
Does this answer your question? [Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C](https://stackoverflow.com/questions/746171/efficient-algorithm-for-bit-reversal-from-msb-lsb-to-lsb-msb-in-c) — phuclv, Apr 29 '22 at 15:09
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481 and https://clang.llvm.org/docs/LanguageExtensions.html#builtin-bitreverse — artless noise, Oct 27 '22 at 19:29

Mike DeSimone · Answer 1 · 2018-12-06T04:50:00.727

15

It's actually called "bit reversal", and is commonly done in FFT scrambling. The O(log N) way is (for up to 32 bits):

uint32_t reverse(uint32_t x, int bits)
{
    x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1); // Swap _<>_
    x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2); // Swap __<>__
    x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4); // Swap ____<>____
    x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8); // Swap ...
    x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16); // Swap ...
    return x >> (32 - bits);
}

Maybe this small "visualization" helps:
An example of the first 3 assignment, with a uint8_t example:

b7 b6 b5 b4  b3 b2 b1 b0
-> <- -> <-  -> <- -> <-
----> <----  ----> <----
---------->  <----------

Well, if we're doing ASCII art, here's mine:

7 6 5 4 3 2 1 0
 X   X   X   X 
6 7 4 5 2 3 0 1
 \ X /   \ X /
  X X     X X
 / X \   / X \
4 5 6 7 0 1 2 3
 \ \ \ X / / /
  \ \ X X / /
   \ X X X /
    X X X X
   / X X X \
  / / X X \ \
 / / / X \ \ \
0 1 2 3 4 5 6 7

It kind of looks like FFT butterflies. Which is why it pops up with FFTs.

edited Dec 06 '18 at 04:50

answered Nov 22 '10 at 13:41

Mike DeSimone

41,631
10
72
96

You're probably better off using a `uint_fast32_t` for `x` rather than a signed type (that may be less than 32 bits wide). Those shifts are UB on signed types. – Toby Speight Jul 13 '18 at 12:19
@TobySpeight Yeah, I copied this from C++ code where the type was `uint32_t` originally and there were overloads for different widths. It predates `uint_fast32_t` being available in the compiler. – Mike DeSimone Jul 14 '18 at 14:42
1

@TobySpeight: right shift of a signed integer is implementation defined, *not* UB. (As long as the shift count is smaller than the type width, of course, same as for unsigned.) It's an arithmetic right shift on all implementations I'm aware of, but those are all 2's complement machines. C also allows logical right shift. But anyway, yes, `uint32_t` is a good choice, and arithmetic right shifts were a bug. – Peter Cordes Jul 14 '18 at 14:59

score 3 · Answer 2 · answered Nov 22 '10 at 13:42

3

Per Rich Schroeppel in this MIT memo (if you can read past the assembler), the following will reverse the bits in an 8bit byte providing that you have 64bit arithmetic available:

byte = (byte * 0x0202020202ULL & 0x010884422010ULL) % 1023;

Which sort of fans the bits out (the multiply), selects them (the and) and then shrinks them back down (the modulus).

Is it actually an 8bit quantity that you have?

answered Nov 22 '10 at 13:42

Tommy

99,986
12
185
204

1

While very clever, on many platforms divide (`/`) and modulo (`%`) are expensive, multi-cycle operations, especially if it isn't a power of 2 that the compiler can optimize into a bit mask operation. – Mike DeSimone Nov 22 '10 at 16:08
This is clever but must be at least 20 times slower than the obvious lookup table approach.. – R.. GitHub STOP HELPING ICE Nov 22 '10 at 16:41
2

@R: depends on your CPU. I'll bet it's three cycles on a modern Intel, all of which are amenable to parallel parts of the pipeline, whereas a table based approach has the major disadvantage of, at best, occupying valuable cache and, at worst, causing a pipeline stall while memory is accessed. – Tommy Nov 22 '10 at 16:51
@R.: `% 1023` costs about 2 multiplies + a sub to do it in terms of `x - (x / 1023) * 1023`, using a fixed-point multiplicative inverse for the `/1023` ([Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935)). gcc and clang choose to do the multiply by 1023 with shift/sub because it's close to a power of 2. On a modern x86 (with 3-cycle multiply) it looks like about 13 cycle latency for the whole thing, following the dep chain through the `imul` and so on. Good ILP, but a table lookup would have better ILP. – Peter Cordes Jul 13 '18 at 10:09
([Godbolt link for previous comment](https://godbolt.org/g/iUYxnV)) Of course if you're tuning for modern x86, you'd use SSSE3 `pshufb` to do parallel 4-bit lookups, and bit-reverse a whole 32-bit int in a few shuffles. – Peter Cordes Jul 13 '18 at 10:10

score 3 · Answer 3 · answered Jul 13 '18 at 13:20

Nearly a duplicate of Most Efficient Algorithm for Bit Reversal ( from MSB->LSB to LSB->MSB) in C (which has a lot of answers, including one AVX2 answer for reversing every 8-bit char in an array).

X86

On x86 with SSSE3 (Core2 and later, Bulldozer and later), pshufb (_mm_shuffle_epi8) can be used as a nibble LUT to do 16 lookups in parallel. You only need 8 lookups for the 8 nibbles in a single 32-bit integer, but the real problem is splitting the input bytes into separate nibbles (with their upper half zeroed). It's basically the same problem as for pshufb-based popcount.

avx2 register bits reverse shows how to do this for a packed vector of 32-bit elements. The same code ported to 128-bit vectors would compile just fine with AVX.

It's still good for a single 32-bit int because x86 has very efficient round-trip between integer and vector regs: int bitrev = _mm_cvtsi128_si32 ( rbit32( _mm_cvtsi32_si128(input) ) );. That only costs 2 extra movd instructions to get an integer from an integer register into XMM and back. (Round trip latency = 3 cycles on an Intel CPU like Haswell.)

ARM:

rbit has single-cycle latency, and does a whole 32-bit integer in one instruction.

score 2 · Answer 4 · edited Jul 14 '18 at 00:06

The naive / slow / simple way is to extract the low bit of the input and shift it into another variable that accumulates a return value.

#include <stdint.h>

uint32_t mirror_u32(uint32_t input) {
    uint32_t returnval = 0;
    for (int i = 0; i < 32; ++i) {
        int bit = input & 0x01;
        returnval <<= 1;
        returnval += bit;    // Shift the isolated bit into returnval
        input >>= 1;
    }
    return returnval;
}

For other types, the number of bits of storage is sizeof(input) * CHAR_BIT, but that includes potential padding bits that aren't part of the value. The fixed-width types are a good idea here.

The += instead of |= makes gcc compile it more efficiently for x86 (using x86's shift-and-add instruction, LEA). Of course, there are much faster ways to bit-reverse; see the other answers. This loop is good for small code size (no large masks), but otherwise pretty much no advantage.

Compilers unfortunately don't recognize this loop as a bit-reverse and optimize it to ARM rbit or whatever. (See it on the Godbolt compiler explorer)

Not going to return your result? – Mike DeSimone Nov 22 '10 at 13:42 — Mike DeSimone, Nov 22 '10 at 13:42
`void` function with `return`? :-) – Christopher Creutzig Nov 22 '10 at 13:49 — Christopher Creutzig, Nov 22 '10 at 13:49

score 2 · Answer 5 · answered Nov 22 '10 at 16:45

Fastest approach is almost sure to be a lookup table:

out[0]=lut[in[3]];
out[1]=lut[in[2]];
out[2]=lut[in[1]];
out[3]=lut[in[0]];

Or if you can afford 128k of table data (by afford, I mean cpu cache utilization, not main memory or virtual memory utilization), use 16-bit units:

out[0]=lut[in[1]];
out[1]=lut[in[0]];

score 2 · Answer 6 · answered Oct 18 '18 at 08:42

If you are interested in a more embedded approach, when I worked with an armv7a system, I found the RBIT command.

So within a C function using GNU extended asm I could use:

uint32_t bit_reverse32(uint32_t inp32)
{
    uint32_t out = 0;
    asm("RBIT %0, %1" : "=r" (out) : "r" (inp32));
    return out;
}

There are compilers which expose intrinsic C wrappers like this. (armcc __rbit) and gcc also has some intrinsic via ACLE but with gcc-arm-linux-gnueabihf I could not find __rbit C so I came up with the upper code.

I didn't look, but I suppose on other platforms you could create similar solutions.

This is an ARM instruction. Most other ISAs don't have an equivalent instruction. Rust does have a builtin for it, [`return foo.reverse_bits()`](https://doc.rust-lang.org/std/primitive.u32.html#method.reverse_bits). But even C++20 `` omitted it, only providing `std::byte_swap` (https://en.cppreference.com/w/cpp/header/bit) — Peter Cordes, Aug 04 '22 at 09:29

score 1 · Answer 7 · edited Sep 27 '12 at 16:57

1

I've also just figured out a minimal solution for mirroring 4 bits (a nibble) in only 16 bits temporary space.

mirr = ( (orig * 0x222) & 0x1284 ) % 63

edited Sep 27 '12 at 16:57

jonsca

10,218
26
54
62

answered Aug 31 '12 at 17:36

mastamac

21
1

score 0 · Answer 8 · answered Nov 22 '10 at 13:53

0

I think I would make a lookup table of bitpatterns 0-255. Read each byte and with the lookup table reverse that byte and afterwards arrange the resulting bytes appropriately.

answered Nov 22 '10 at 13:53

onemasse

6,514
8
32
37

1

The really cool thing is that an 8 bit table lookup can be done in a single instruction (XLAT) in Intel x86 assembly. Not one of the fastest instructions, but it does it in a single relatively quick instruction! :-) – Brian Knoblauch Nov 23 '10 at 19:58

score 0 · Answer 9 · edited Apr 30 '15 at 21:55

0

quint64 mirror(quint64 a,quint8 l=64) {
    quint64 b=0;
    for(quint8 i=0;i&lt;l;i++) {
        b|=(a>>(l-i-1))&((quint64)1<<i);
    }
return b;
}

This function mirroring less then 64 bits. For instance it can mirroring 12 bits.

quint64 and quint8 are defined in Qt. But it possible redefine it in anyway.

edited Apr 30 '15 at 21:55

Alex Riley

169,130
45
262
238

answered Nov 23 '10 at 20:06

porter

1
1

lnksz · Answer 10 · 2018-10-18T07:58:54.433

0

If you have been staring at Mike DeSimone's great answer (like me), here is a "visualization" on the first 3 assignment, with a uint8_t example:

b7 b6 b5 b4  b3 b2 b1 b0
-> <- -> <-  <- -> <- ->
----> <----  ----> <----
---------->  <----------

So first, bitwise swap, then "two-bit-group" swap and so on.

edited Oct 18 '18 at 07:58

answered Oct 16 '18 at 14:16

lnksz

421
3
15

This would be better as an edit to [the answer you're referencing](https://stackoverflow.com/questions/4245936/mirror-bits-of-a-32-bit-word/4245986#4245986); it doesn't really stand alone as an answer. – Peter Cordes Oct 16 '18 at 15:50
@PeterCordes Added as comment – lnksz Oct 18 '18 at 08:08
Thanks for helping make Stack Overflow better; that's a good addition to that answer. – Peter Cordes Oct 18 '18 at 14:51

EnzoR · Answer 11 · 2021-04-10T10:33:52.597

For sure most people won't consider my approach neither as elegant nor efficient: it's aimed at being portable and somehow "straightforward".

#include <limits.h> // CHAR_BIT

unsigned bit_reverse( unsigned s ) {
  unsigned d;
  int i;
  for( i=CHAR_BIT*sizeof( unsigned ),d=0; i; s>>=1,i-- ) {
    d <<= 1;
    d |= s&1;
  }
  return d;
}

This function pulls the least significant bit from the source bistring s and pushes it as the most significant bit in the destination bitstring d.

You can replace unsigned data type with whatever suits your case, from unsigned char (CHAR_BIT bits, usually 8) to unsigned long long (128 bits in modern 64-bit CPUs).

Of course, there can be CPU-specific instructions (or instruction sets) that could be used instead of my plain C code.

But than that wouldn't be "C language" but rather assembly instruction(s) in a C wrapper.

If you're aiming for portable, use `CHAR_BIT` instead of hard-coding `8`. Some modern DSPs are word-addressable and thus have 16, 24, or even 32-bit `char`, so it's not just a matter of legacy machines with 9-bit bytes or whatever. — Peter Cordes, Apr 09 '21 at 20:47

score -2 · Answer 12 · answered Jul 13 '18 at 09:12

-2

int mirror (int input)
{// return bit mirror of 8 digit number 
  int tmp2;
  int out=0;
  for (int i=0; i<8; i++)
    {
      out = out << 1;
      tmp2 = input & 0x01;
      out = out | tmp2;
      input = input >> 1;        
    }
   return out;
}

answered Jul 13 '18 at 09:12

Eyal Amer

1

Please add some explanation. – petezurich Jul 13 '18 at 11:18
Same loop a Simone's answer, but for 8-bit and with different var names. Actually, the different order of operations in Simone's answer is a bug. – Peter Cordes Jul 13 '18 at 23:41

mirror bits of a 32 bit word

12 Answers12

X86

ARM:

Linked