How can I shuffle bits efficiently?

Question

I need to shuffle a 16 bit unsigned integer in a way that the even indexes land in the lower byte, and the odd indexes land in the upper byte.

input:
fedcba98 76543210 (contiguously numbered)

output:
fdb97531 eca86420 (even and odd separated)

My code looks like this at the moment:

typedef unsigned short u16;

u16 segregate(u16 x)
{
    u16 g = (x & 0x0001);
    u16 h = (x & 0x0004) >> 1;
    u16 i = (x & 0x0010) >> 2;
    u16 j = (x & 0x0040) >> 3;
    u16 k = (x & 0x0100) >> 4;
    u16 l = (x & 0x0400) >> 5;
    u16 m = (x & 0x1000) >> 6;
    u16 n = (x & 0x4000) >> 7;

    u16 o = (x & 0x0002) << 7;
    u16 p = (x & 0x0008) << 6;
    u16 q = (x & 0x0020) << 5;
    u16 r = (x & 0x0080) << 4;
    u16 s = (x & 0x0200) << 3;
    u16 t = (x & 0x0800) << 2;
    u16 u = (x & 0x2000) << 1;
    u16 v = (x & 0x8000);

    return g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v;
}

I wonder if there is a more elegant solution than simply extracting and shifting each individual bit?

"looks very slow" Put a profiler on it. That will tell you if it is in fact slow. — Almo, Sep 09 '14 at 17:47
It looks slow, but is it *actually* too slow for your particular application? Measure twice, cut once. — Robert Harvey, Sep 09 '14 at 17:47
[Related](http://stackoverflow.com/questions/4909263/how-to-efficiently-de-interleave-bits-inverse-morton), I think. — jrok, Sep 09 '14 at 17:57
Just feed "0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15" to this page: [" Code generator for bit permutations"](http://programming.sirrida.de/calcperm.php). — Evgeny Kluev, Sep 09 '14 at 18:20
@EvgenyKluev I tried "0 2 4 6 8 a c e 1 3 5 7 9 b d f 10 12 14 16 18 1a 1c 1e 11 13 15 17 19 1b 1d 1f" to process two 16-bit words in parallel. How can I expand that to four 16-bit words? What additional mask and shift do I need? — fredoverflow, Sep 09 '14 at 19:46
Following the logic of generated 32-bit program, to transform four 16-bit words, you only have to extend all masks to 64 bits, but I haven't tested this. Actually you could do even 8 or 16 parallel transforms with SIMD. — Evgeny Kluev, Sep 09 '14 at 20:22
Or you could combine table lookup approach with part of the generated code (its third line, which exchanges 4-bit chunks). Instead of the first two generated lines do two table lookups in a single 256-byte table (to shuffle each source byte separately). This allows to keep lookup table small, uses only 2 memory accesses, needs not too much calculations (i.e. balances calculations and memory accesses), but does not allow parallel processing. — Evgeny Kluev, Sep 09 '14 at 21:16
@EvgenyKluev Awesome, the 64 bit shuffling works! Want to write that down in an answer? Since I ended up using that solution, I would accept your answer. — fredoverflow, Sep 10 '14 at 07:22

Nils Pipenbrinck · Answer 1 · 2014-09-09T19:30:50.083

14

The table approach shown by others is the most portable version and is probably quite fast.

If you want to take advantage of special instruction sets there are some other options as well. For Intel Haswell and later for example the following approach can be used (requires the BMI2 instruction set extension):

unsigned segregate_bmi (unsigned arg)
{
  unsigned oddBits  = _pext_u32(arg,0x5555);
  unsigned evenBits = _pext_u32(arg,0xaaaa);
  return (oddBits | (evenBits << 8));
}

edited Sep 09 '14 at 19:30

answered Sep 09 '14 at 19:23

Nils Pipenbrinck

83,631
31
151
221

2

Cool instruction! "For each bit set in the mask, the intrinsic extracts the corresponding bits from the first source operand and writes them into contiguous lower bits of the destination. The remaining upper bits of the destination are set to 0." (says [Intel](https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx2_pext_u.htm)). I bet this is meant for some graphics processing. – Jongware Sep 09 '14 at 21:28
1

@Jongware Yup. It does all kinds of bit-field extraction. Together with it's brother instruction pdep you can do any kind of permutations and bit shuffles very fast. – Nils Pipenbrinck Sep 09 '14 at 21:32
Is there a `IsProcessorFeaturePresent` check for this? (`cpuid` is unreliable on multiprocessor) – user541686 Jun 14 '21 at 23:59

score 13 · Accepted Answer · edited May 23 '17 at 10:30

There is a very convenient web resource that helps solving many bit permutation problems: Code generator for bit permutations. In this particular case feeding "0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15" to this page produces pretty fast code.

Unfortunately this code generator cannot produce 64-bit code (though anyone could download sources and add this option). So if we need to perform 4 permutations in parallel using 64-bit instructions, we have to extend all involved bitmasks to 64 bits manually:

uint64_t bit_permute_step(uint64_t x, uint64_t m, unsigned shift) {
  uint64_t t;
  t = ((x >> shift) ^ x) & m;
  x = (x ^ t) ^ (t << shift);
  return x;
}

uint64_t segregate4(uint64_t x)
{ // generated by http://programming.sirrida.de/calcperm.php, extended to 64-bit
  x = bit_permute_step(x, 0x2222222222222222ull, 1);
  x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0cull, 2);
  x = bit_permute_step(x, 0x00f000f000f000f0ull, 4);
  return x;
}

Level of parallelism could be increased even more (8 or 16 permutations at once) with SSE instructions. (And recent versions of gcc can vectorize this code automatically).

If parallelism is not required and data cache is not extensively used by other parts of the program, better alternative would be to use lookup table. Various LUT approacehes are already discussed in other answers, still some more could be said here:

The first and the last bits of 16-bit word are never permuted, we need to shuffle only bits 1..14. So (if we want to perform the task with single LUT access) it is enough to have a LUT with 16K entries which means 32K of memory.
We could combine table lookup and computation approaches. Two lookups in a single 256-byte table could shuffle each source byte separately. After this we only need to exchange two middle 4-bit nibbles. This allows to keep lookup table small, uses only 2 memory accesses, and needs not too much calculations (i.e. balances calculations and memory accesses).

Here is implementation of second approach:

#define B10(x)          x+0x00,      x+0x10,      x+0x01,      x+0x11
#define B32(x)      B10(x+0x00), B10(x+0x20), B10(x+0x02), B10(x+0x22)
#define B54(x)      B32(x+0x00), B32(x+0x40), B32(x+0x04), B32(x+0x44)
uint8_t lut[256] = {B54(  0x00), B54(  0x80), B54(  0x08), B54(  0x88)};
#undef B54
#undef B32
#undef B10

uint_fast16_t segregateLUT(uint_fast16_t x)
{
  uint_fast16_t low = lut[x & 0x00ff];
  low |= low << 4;
  uint_fast16_t high = lut[x >> 8] << 4;
  high |= high << 4;
  return (low & 0x0f0f) | (high & 0xf0f0);
}

But fastest approach (if portability is not an issue) is using pext instruction from BMI2 instruction set as noted by Nils Pipenbrinck. With a pair of 64-bit pext we could perform 4 16-bit shuffles in parallel. Since pext instruction is intended exactly for this kind of bit permutations, this approach easily outperforms all others.

score 12 · Answer 3 · answered Sep 09 '14 at 17:53

12

You could use a 256-byte table for each byte of your 16-bit number, crafted so that your even/odd condition is satisfied. Hand-code the table entries (or use the algorithm you already have) to create the tables, and then the shuffling will be done at compile time. That would essentially be a translation table concept.

answered Sep 09 '14 at 17:53

Logicrat

4,438
16
22

2

I agree. That's the fastest way to shuffle. You can use an array or a map and it will be an O(1) operation. – ventsyv Sep 09 '14 at 18:15
(Side note: One should always run benchmarks, particularly at such a low level: Using a lookup table instead of few OR/SHIFT instructions *might* have a negative impact on performance due to caching...) – Marco13 Sep 10 '14 at 10:06

score 6 · Answer 4 · answered Sep 09 '14 at 18:16

You could use a 256-byte table for each byte of your 16-bit number, crafted so that your even/odd condition is satisfied.

Ah yes, lookup tables to the rescue :) You can even do it with a single table and one extra shift:

u16 every_other[256] = {
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f};

u16 segregate(u16 x)
{
    return every_other[x & 0xff]
         | every_other[(x >> 8)] << 4
         | every_other[(x >> 1) & 0xff] << 8
         | every_other[(x >> 9)] << 12;
}

Or you could make it a table of 256 uint16_t and `return every_other[x&0xff]|every_other[x>>8]<<4`. — rici, Sep 09 '14 at 19:18
@NickyC As the table maps bytes to nibbles, values are bound to repeat. — fredoverflow, Sep 10 '14 at 18:26
@FredOverflow OK, repeating with good reason, it is good enough. — , Sep 11 '14 at 01:02

Yakk - Adam Nevraumont · Answer 5 · 2014-09-09T20:39:03.637

Tables. But generate them at compile time!

namespace details {
  constexpr uint8_t bit( unsigned byte, unsigned n ) {
    return (byte>>n)&1;
  }
  constexpr uint8_t even_bits(uint8_t byte) {
    return bit(byte, 0) | (bit(byte, 2)<<1) | (bit(byte, 4)<<2) | (bit(byte, 6)<<3);
  }
  constexpr uint8_t odd_bits(uint8_t byte) {
    return even_bits(byte/2);
  }
  template<unsigned...>struct indexes{using type=indexes;};
  template<unsigned Max,unsigned...Is>struct make_indexes:make_indexes<Max-1,Max-1,Is...>{};
  template<unsigned...Is>struct make_indexes<0,Is...>:indexes<Is...>{};
  template<unsigned Max>using make_indexes_t=typename make_indexes<Max>::type;

  template<unsigned...Is>
  constexpr std::array< uint8_t, 256 > even_bit_table( indexes<Is...> ) {
    return { even_bits(Is)... };
  }
  template<unsigned...Is>
  constexpr std::array< uint8_t, 256 > odd_bit_table( indexes<Is...> ) {
    return { odd_bits(Is)... };
  }
  constexpr std::array< uint8_t, 256 > even_bit_table() {
    return even_bit_table( make_indexes_t<256>{} );
  }
  constexpr std::array< uint8_t, 256 > odd_bit_table() {
    return odd_bit_table( make_indexes_t<256>{} );
  }

  static constexpr auto etable = even_bit_table();
  static constexpr auto otable = odd_bit_table();
}

uint8_t constexpr even_bits( uint16_t in ) {
  return details::etable[(uint8_t)in] | ((details::etable[(uint8_t)(in>>8)])<<4);
}
uint8_t constexpr odd_bits( uint16_t in ) {
  return details::otable[(uint8_t)in] | ((details::otable[(uint8_t)(in>>8)])<<4);
}

live example

@dyp no reason. Well, `unsigned byte` is sort of funny, but it could be just as funny as a ... function? runtime? parameter. (what do you call non-template parameters?) — Yakk - Adam Nevraumont, Sep 09 '14 at 20:27
@dyp well, I rewrote the live example, and found a reason: as written, `odd_bits` would always run in `O(1)` in either the `uint16_t` or the `` version. Of course, the `` version is also bad to use. So I stuffed everything into `details`. — Yakk - Adam Nevraumont, Sep 09 '14 at 20:38
@dyp it can shift exactly 4 and 8 steps in O(1)! Now, if it took a different amount of time to do an 8 bit array lookup if the index was larger... (everything is O(1) if your input data is limited to 16 bits) — Yakk - Adam Nevraumont, Sep 09 '14 at 20:51

score 1 · Answer 6 · answered Oct 27 '15 at 02:29

your answer to the even and odd bits shuffle for 64 bits is not accurate. To extend the 16 bit solution to 64 bit solution, we need not only to extend the masks, but also cover the swapping interval from 1 all the way to 16:

x = bit_permute_step(x, 0x2222222222222222, 1);
x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0c, 2);
x = bit_permute_step(x, 0x00f000f000f000f0, 4);
**x = bit_permute_step(x, 0x0000ff000000ff00, 8);
x = bit_permute_step(x, 0x00000000ffff0000, 16);**

score 0 · Answer 7 · 2015-10-30T14:49:27.150

0

In favour of being short:

unsigned short segregate(unsigned short x)
{
    x = (x & 0x9999) | (x >> 1 & 0x2222) | (x << 1 & 0x4444);
    x = (x & 0xC3C3) | (x >> 2 & 0x0C0C) | (x << 2 & 0x3030);
    x = (x & 0xF00F) | (x >> 4 & 0x00F0) | (x << 4 & 0x0F00);
    return x;
}

edited Oct 30 '15 at 14:49

answered Sep 10 '14 at 13:22

How can I shuffle bits efficiently?

7 Answers7

Linked