Fastest way to duplicate bits of an integer

Question

What is the fastest way to duplicate the bits of an integer.

For example,

17 -> 10001

after duplication: 1100000011

In what language? What's fast in C is not necessarily going to be the fastest solution in Python. — ShadowRanger, Mar 26 '19 at 00:13
@Kuais C/C++ is not a language. C and C++ are different languages; they are similar in a lot of ways, but they are fundamentally different things. For example, in C++, the best solution might involve a [`std::bitset`](https://en.cppreference.com/w/cpp/utility/bitset), which isn't available in C. — Daniel H, Mar 26 '19 at 00:21
Possible duplicate of [Bit duplication from 8-bit to 32-bit](https://stackoverflow.com/questions/55051490/bit-duplication-from-8-bit-to-32-bit) — , Mar 26 '19 at 03:51
@Kuais: you can accept one of the answers by clicking on the grey checkmark below its score. — chqrlie, Apr 13 '19 at 11:53

Amos · Accepted Answer · 2019-03-26T00:48:43.487

2

Looks like a variation of bit interleaving.

Interleave bits the obvious way

(modified from http://graphics.stanford.edu/~seander/bithacks.html)
unsigned int x = 17;
unsigned int z = 0; // z gets the resulting Morton Number.

for (int i = 0; i < sizeof(x) * CHAR_BIT; i++) // unroll for more speed...
{
  z |= (x & 1U << i) << i | (x & 1U << i) << (i + 1);
}

See http://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableObvious for more approaches.

edited Mar 26 '19 at 00:48

answered Mar 26 '19 at 00:10

Amos

3,238
4
19
41

the return should be `771` but this returns `2313` – Kuais Mar 26 '19 at 00:19
2

@Kuais: The initial value of `y` should be the same as `x`, not `x << 1`. Do that (or just get rid of `y` entirely and use `x` everywhere `y` is used) and it works. The `<< 1` is already handled in the body of the loop. – ShadowRanger Mar 26 '19 at 00:33
2

The expression can be simplified as `z |= (x & 1U << i) * 3 << i;` and the loop test should be `i < (sizeof(x) * CHAR_BIT + 1) / 2`. The expression `(x & 1U << i) << (i + 1)` as undefined behavior for the largest value of `i`. – chqrlie Mar 26 '19 at 00:50

chqrlie · Answer 2 · 2019-03-26T20:36:42.733

2

If the integer size is 32-bits, a very fast way to achieve this involves an array of 256 16-bit words and 2 lookups:

uint16_t dup16[256] = { 0, 3, 12, 15, 48, ... };
uint32_t x = 17
uint32_t z = dup16[x & 255] | ((uint32_t)dup16[(x >> 8) & 255] << 16);

edited Mar 26 '19 at 20:36

answered Mar 26 '19 at 01:00

chqrlie

131,814
10
121
189

Arne Vogel · Answer 3 · 2019-03-28T09:32:51.450

2

For GCC with at least -march=haswell:

#include <cstddef>
#include <iomanip>
#include <iostream>

#include <immintrin.h>

std::uint64_t interleave_pdep(std::uint32_t input)  {
    return _pdep_u64(input, 0x5555555555555555) 
         | _pdep_u64(input, 0xaaaaaaaaaaaaaaaa);
}

int main()
{
    std::uint32_t i;
    std::cin >> std::hex;
    std::cout << std::hex;
    while (std::cin >> i)
        std::cout << interleave_pdep(i) << '\n';
}

(with credits to Daniel Lemire's blog)

E.g. 11 -> 303 (binary number I/O left as an exercise to the reader…)

edited Mar 28 '19 at 09:32

answered Mar 26 '19 at 13:07

Arne Vogel

6,346
2
18
31

1

Please do **not** use internals like `__builtin_ia32_pdep_di` directly, use the official `_pdep_u64` instead. – Marc Glisse Mar 26 '19 at 21:05
@MarcGlisse I tried using that, but couldn't get it to work with GCC 5.5 on Ubuntu 16.04 (`_pdep_u64` was not declared in this scope) and also there is no `pdep` under `/usr/include` except for in some LLVM headers. Documentation is also rather scarce. If you could clear this up, I'd appreciate it. – Arne Vogel Mar 27 '19 at 15:50
3

You need to include `immintrin.h` or `x86intrin.h`. The doc is somewhere on Intel's website. – Marc Glisse Mar 27 '19 at 18:09
I believe "_pdep_u64(input, 0x5555555555555555)*3" may be one cycle faster (6 vs 7) in most CPUs, and "auto d=_pdep_u64(input, 0x5555555555555555); return d|(d<<1);" maybe two (5 vs 7). – paperclip optimizer Jan 24 '22 at 03:48

Richard Hodges · Answer 4 · 2019-03-26T02:08:04.727

I should think the fastest way would be a lookup table.

You can compute it at compile time.

#include <array>
#include <cstdint>
#include <iostream>
#include <iomanip>

constexpr std::uint64_t dup_bit(bool b)
{
    unsigned result = 0;
    if (b) result = 0x3;
    return result;
}

constexpr std::uint64_t slow_dup_bits(std::uint64_t input)
{
    std::uint64_t result = 0;
    for(int i = 16 ; i-- ; )
    {
        result <<= 2;
        result |= dup_bit(input & (1u << i));
    }
    return result;
}

template<std::uint64_t Start, std::uint64_t...Is>
constexpr auto make_dup_table(std::integer_sequence<std::uint64_t, Is...>)
{
    return std::array<std::uint64_t, sizeof...(Is)>
    {{
        slow_dup_bits(Start + Is)...,
    }};
}


std::uint64_t dup_bits(std::uint64_t x)
{
    static const auto table = make_dup_table<0>(std::make_integer_sequence<std::uint64_t, 65536>());
    auto low32 = table[x & 65535];
    auto high32 = table[x >> 16];
    return (high32 << 32) + low32;
}

int main()
{
    for(std::uint64_t i = 0 ; i < 64 ; ++i)
        std::cout << std::setw(8) << std::setfill(' ') << i 
            << " -> " 
            << std::hex << std::setw(64) << std::setfill('0') << dup_bits(i) << '\n';
}

expected output:

   0 -> 0000000000000000000000000000000000000000000000000000000000000000
   1 -> 0000000000000000000000000000000000000000000000000000000000000003
   2 -> 000000000000000000000000000000000000000000000000000000000000000c
   3 -> 000000000000000000000000000000000000000000000000000000000000000f
   4 -> 0000000000000000000000000000000000000000000000000000000000000030
   5 -> 0000000000000000000000000000000000000000000000000000000000000033
   6 -> 000000000000000000000000000000000000000000000000000000000000003c
   7 -> 000000000000000000000000000000000000000000000000000000000000003f
   8 -> 00000000000000000000000000000000000000000000000000000000000000c0
   9 -> 00000000000000000000000000000000000000000000000000000000000000c3
   a -> 00000000000000000000000000000000000000000000000000000000000000cc
   b -> 00000000000000000000000000000000000000000000000000000000000000cf
   c -> 00000000000000000000000000000000000000000000000000000000000000f0
   d -> 00000000000000000000000000000000000000000000000000000000000000f3
   e -> 00000000000000000000000000000000000000000000000000000000000000fc
   f -> 00000000000000000000000000000000000000000000000000000000000000ff
  10 -> 0000000000000000000000000000000000000000000000000000000000000300
  11 -> 0000000000000000000000000000000000000000000000000000000000000303
  12 -> 000000000000000000000000000000000000000000000000000000000000030c
  13 -> 000000000000000000000000000000000000000000000000000000000000030f
  14 -> 0000000000000000000000000000000000000000000000000000000000000330
  15 -> 0000000000000000000000000000000000000000000000000000000000000333
  16 -> 000000000000000000000000000000000000000000000000000000000000033c
  17 -> 000000000000000000000000000000000000000000000000000000000000033f
  18 -> 00000000000000000000000000000000000000000000000000000000000003c0
  19 -> 00000000000000000000000000000000000000000000000000000000000003c3
  1a -> 00000000000000000000000000000000000000000000000000000000000003cc
  1b -> 00000000000000000000000000000000000000000000000000000000000003cf
  1c -> 00000000000000000000000000000000000000000000000000000000000003f0
  1d -> 00000000000000000000000000000000000000000000000000000000000003f3
  1e -> 00000000000000000000000000000000000000000000000000000000000003fc
  1f -> 00000000000000000000000000000000000000000000000000000000000003ff
  20 -> 0000000000000000000000000000000000000000000000000000000000000c00
  21 -> 0000000000000000000000000000000000000000000000000000000000000c03
  22 -> 0000000000000000000000000000000000000000000000000000000000000c0c
  23 -> 0000000000000000000000000000000000000000000000000000000000000c0f
  24 -> 0000000000000000000000000000000000000000000000000000000000000c30
  25 -> 0000000000000000000000000000000000000000000000000000000000000c33
  26 -> 0000000000000000000000000000000000000000000000000000000000000c3c
  27 -> 0000000000000000000000000000000000000000000000000000000000000c3f
  28 -> 0000000000000000000000000000000000000000000000000000000000000cc0
  29 -> 0000000000000000000000000000000000000000000000000000000000000cc3
  2a -> 0000000000000000000000000000000000000000000000000000000000000ccc
  2b -> 0000000000000000000000000000000000000000000000000000000000000ccf
  2c -> 0000000000000000000000000000000000000000000000000000000000000cf0
  2d -> 0000000000000000000000000000000000000000000000000000000000000cf3
  2e -> 0000000000000000000000000000000000000000000000000000000000000cfc
  2f -> 0000000000000000000000000000000000000000000000000000000000000cff
  30 -> 0000000000000000000000000000000000000000000000000000000000000f00
  31 -> 0000000000000000000000000000000000000000000000000000000000000f03
  32 -> 0000000000000000000000000000000000000000000000000000000000000f0c
  33 -> 0000000000000000000000000000000000000000000000000000000000000f0f
  34 -> 0000000000000000000000000000000000000000000000000000000000000f30
  35 -> 0000000000000000000000000000000000000000000000000000000000000f33
  36 -> 0000000000000000000000000000000000000000000000000000000000000f3c
  37 -> 0000000000000000000000000000000000000000000000000000000000000f3f
  38 -> 0000000000000000000000000000000000000000000000000000000000000fc0
  39 -> 0000000000000000000000000000000000000000000000000000000000000fc3
  3a -> 0000000000000000000000000000000000000000000000000000000000000fcc
  3b -> 0000000000000000000000000000000000000000000000000000000000000fcf
  3c -> 0000000000000000000000000000000000000000000000000000000000000ff0
  3d -> 0000000000000000000000000000000000000000000000000000000000000ff3
  3e -> 0000000000000000000000000000000000000000000000000000000000000ffc
  3f -> 0000000000000000000000000000000000000000000000000000000000000fff

http://coliru.stacked-crooked.com/a/7b92ad167bd54ae7

Alain Merigot · Answer 5 · 2019-03-26T22:29:35.587

A solution without loops or LUTs.

It works on 4 bits nibbles. First the nibble is duplicated and shifted by a multplication. Then interesting bits are masked and duplicated. Then all bits are gathered together by shifts and ors.

int duplicate4(unsigned char c){
  // c is a 4bits nibble. 
  // Assume c=2#dcba
  unsigned long long spread=(1+(1<<9)+(1<<18)+(1<<27));
       // to duplicate the nibble
       //=2#00001000.00000100.00000010.00000001
  unsigned long long ll=c*spread;
  // ll =   0dcba000.00dcba00.000dcba0.0000dcba
  unsigned long long mask=(1+(1<<10)+(1<<20)+(1<<30));
       // to extract bits
       //=2#01000000.00010000.00000100.00000001
  ll &= mask;
  // ll = 0d000000.000c0000.00000b00.0000000a
  ll |= (ll<<1) ;
  // ll = dd000000.00cc0000.0000bb00.000000aa
  ll |= (ll >> 16) ;
  // ll = dd000000.00cc0000.dd00bb00.00cc00aa
  ll |= (ll >> 8) ;
  // ll = dd000000.00cc0000.dd00bb00.ddccbbaa
  ll &= 0xff ;
  // ll = 00000000.00000000.00000000.ddccbbaa
  return ll;
}

int duplicate8(unsigned char c){
  return duplicate4(c&0xf) + 256*duplicate4((c&0xf0)>>4);
}

Fastest way to duplicate bits of an integer

5 Answers5

Interleave bits the obvious way