Bit duplication from 8-bit to 32-bit

Question

I'm trying to duplicate an 8-bit value to 32-bit and wanted to ask if it's possible to write a single line algorithm to duplicate the bit values.

For example:

1100 1011 -> 1111 1111 0000 0000 1111 0000 1111 1111

If it's possible, I would like to understand what's the logic behind it.

In other words, your goal is to translate every bit in an 8-bit byte into a nybble in O(1)? — Govind Parmar, Mar 07 '19 at 19:35
The answer is yes. The logic behind that solution is simply that you don't put linebreaks into it. — Ulrich Eckhardt, Mar 07 '19 at 19:35
Can you rely on any processor architecture (e.g. x86) or instruction set (e.g. BMI2)? — Iwillnotexist Idonotexist, Mar 07 '19 at 19:47
I'm trying to write it down on a PIC18 microchip, (PIC18F46J50). @harold it's not avaliable — Play38, Mar 07 '19 at 19:49
This is similar to a [question](https://stackoverflow.com/q/4597940/509868) I once asked. I accepted the answer which suggested using a lookup table, but this [other answer](https://stackoverflow.com/a/4600182/509868) gives an efficient bit-fiddling algorithm, which you could squeeze into a single line of code. — anatolyg, Mar 07 '19 at 20:04
Bit manipulation on PIC will be very slow in this case, because you don't have a barrel shifter and can only [shift by 1](https://en.wikipedia.org/wiki/PIC_instruction_listings#PIC18_high_end_core_devices_(16_bit)), which makes [shift by 1 faster than any other shift count](https://stackoverflow.com/q/3726326/995714) — phuclv, Mar 19 '19 at 16:04

score 6 · Answer 1 · answered Mar 07 '19 at 19:38

6

There are only 256 8-bit values, so a simple lookup table would occupy 1kb, and the lookup is trivial. It's hard to believe that any bithack would have superior performance.

answered Mar 07 '19 at 19:38

rici

234,347
28
237
341

1

That would be a damn long single line :D – Eugene Sh. Mar 07 '19 at 19:39
The sensible way (having read the question correctly the second time round). – Bathsheba Mar 07 '19 at 19:47
3

Or can be reduced to a lookup table of 16 entries and work nibble-wise. Will have more "code" of course. – Eugene Sh. Mar 07 '19 at 19:51
@eugenesh: since the target device seems to only have 8-bit datapaths and no 32-bit register of any form, I guess a 4-entry LUT with 2-bit indices might be appropriate. – rici Mar 07 '19 at 22:05

score 5 · Accepted Answer · 2020-12-08T09:34:30.947

It's simple - solve the simplest case, then do more complex ones.

Case 1: Duplicate 1 bit into an 4 bit value (the simplest).

+---+---------+
| 0 | _ _ _ A |
+---+---------+
| 1 | A A A A |
+---+---------+

This can be done as a simple set of shifts:

x = (x << 0) | (x << 1) | (x << 2) | (x << 3);

Or in a less obvious but faster way:

x = (x << 4) - x;

This step will be the last step in all following cases.

Case 2: Duplicate 2 bits into an 8 bit value.

+---+---------+---------+
| 0 | _ _ _ _ | _ _ A B |
+---+---------+---------+
| 1 | _ _ _ A | _ _ _ B |
+---+---------+---------+
| 2 | A A A A | B B B B |
+---+---------+---------+

Case 3: Duplicate 4 bits into a 16 bit value. How? Just move 2 bits to the upper part to turn it into the case 1! Divide and conquer!

+---+---------+---------+---------+---------+
| 0 | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D |
+---+---------+---------+---------+---------+
| 1 | _ _ _ _ | _ _ A B | _ _ _ _ | _ _ C D |
+---+---------+---------+---------+---------+
| 2 | _ _ _ A | _ _ _ B | _ _ _ C | _ _ _ D |
+---+---------+---------+---------+---------+
| 3 | A A A A | B B B B | C C C C | D D D D |
+---+---------+---------+---------+---------+

Case 4: Duplicate 8 bits into a 32 bit value (the original).

+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D | E F G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 1 | _ _ _ _ | _ _ _ _ | _ _ _ _ | A B C D | _ _ _ _ | _ _ _ _ | _ _ _ _ | E F G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 2 | _ _ _ _ | _ _ A B | _ _ _ _ | _ _ C D | _ _ _ _ | _ _ E F | _ _ _ _ | _ _ G H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 3 | _ _ _ A | _ _ _ B | _ _ _ C | _ _ _ D | _ _ _ E | _ _ _ F | _ _ _ G | _ _ _ H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+
| 4 | A A A A | B B B B | C C C C | D D D D | E E E E | F F F F | G G G G | H H H H |
+---+---------+---------+---------+---------+---------+---------+---------+---------+

Can be achieved by the code below:

uint32_t interleave(uint8_t value)
{
    uint32_t x = value;
    x = (x | (x << 12)) /* & 0x000F000F */; // GCC is not able to remove redundant & here
    x = (x | (x <<  6)) & 0x03030303;
    x = (x | (x <<  3)) & 0x11111111;
    x = (x << 4) - x;
    return x;
}

Some test cases to check that it works:

TEST_F(test, interleave)
{
    EXPECT_EQ(interleave(0x00), 0x00000000);
    EXPECT_EQ(interleave(0x11), 0x000F000F);
    EXPECT_EQ(interleave(0x22), 0x00F000F0);
    EXPECT_EQ(interleave(0x33), 0x00FF00FF);
    EXPECT_EQ(interleave(0x44), 0x0F000F00);
    EXPECT_EQ(interleave(0x55), 0x0F0F0F0F);
    EXPECT_EQ(interleave(0x66), 0x0FF00FF0);
    EXPECT_EQ(interleave(0x77), 0x0FFF0FFF);
    EXPECT_EQ(interleave(0x88), 0xF000F000);
    EXPECT_EQ(interleave(0x99), 0xF00FF00F);
    EXPECT_EQ(interleave(0xAA), 0xF0F0F0F0);
    EXPECT_EQ(interleave(0xBB), 0xF0FFF0FF);
    EXPECT_EQ(interleave(0xCC), 0xFF00FF00);
    EXPECT_EQ(interleave(0xDD), 0xFF0FFF0F);
    EXPECT_EQ(interleave(0xEE), 0xFFF0FFF0);
    EXPECT_EQ(interleave(0xFF), 0xFFFFFFFF);

    EXPECT_EQ(interleave(0x01), 0x0000000F);
    EXPECT_EQ(interleave(0x23), 0x00F000FF);
    EXPECT_EQ(interleave(0x45), 0x0F000F0F);
    EXPECT_EQ(interleave(0x67), 0x0FF00FFF);
    EXPECT_EQ(interleave(0x89), 0xF000F00F);
    EXPECT_EQ(interleave(0xAB), 0xF0F0F0FF);
    EXPECT_EQ(interleave(0xCD), 0xFF00FF0F);
    EXPECT_EQ(interleave(0xEF), 0xFFF0FFFF);
}

Clever code! It seems you can further simplify the last step: `x = (x << 4) - x;` — chqrlie, Mar 07 '19 at 22:36
@StaceyGirl The improved version boils down to just 12 instructions when compiled for a Pascal-family GPU as the compiler is able to use multiply-add in two places: `LOP32I.AND R0, R4, 0xff; SHL R3, R0, 0xc; LOP.OR R0, R3, R0; LOP32I.AND R3, R0, 0xc000c; SHL R3, R3, 0x6; LOP3.LUT R0, R3, 0x30003, R0, 0xf8; SHL R3, R0, 0x3; LOP3.LUT R0, R3, c[0x0][0x0], R0, 0xc8; XMAD R5, R0.reuse, 0x7, RZ; SHL R3, R0.reuse, 0x3; XMAD.PSL R0, R0.H1, 0x7, R5; LOP.OR R4, R0, R3;` — njuffa, Mar 07 '19 at 22:36
@chqrlie That modification reduces the code to *ten* instructions on a Pascal-family GPU: `LOP32I.AND R0, R4, 0xff; SHL R3, R0, 0xc; LOP.OR R0, R3, R0; LOP32I.AND R3, R0, 0xc000c; SHL R3, R3, 0x6; LOP3.LUT R0, R3, 0x30003, R0, 0xf8; SHL R3, R0, 0x3; LOP3.LUT R0, R3, c[0x0][0x0], R0, 0xc8; XMAD R3, R0.reuse, 0xf, RZ; XMAD.PSL R4, R0.H1, 0xf, R3;` — njuffa, Mar 07 '19 at 22:41
@StaceyGirl: after more analysis, it seems the mask in `x = (x | (x << 12)) & 0x000F000F;` is redundant too. `x |= x << 12;` should suffice. The second mask might also be redundant but I am not sure yet. — chqrlie, Mar 08 '19 at 17:11
@njuffa: I found further simplifications, see comment above. — chqrlie, Mar 08 '19 at 17:12
@chqrlie Yes indeed, the first masking is redundant, but the second one cannot be removed as valid bits will overwrite themselves. Seems like clang is good at optimizing this - it not only removes redundant operations, but also replaces bit-or with addition which allows it to use `lea` on x86 to perform shift and addition in a single instruction. GCC is lagging behind here. — , Mar 08 '19 at 17:17
@chqrlie With that modification (first mask eliminated), `interleave()` compiles to *nine* instructions for a Pascal-family GPU using CUDA 8.0 (latest toolchain is CUDA 10.0 which I have not installed, so can't say whether latest compiler is able to eliminate that mask automagically). — njuffa, Mar 08 '19 at 17:32
Great solution just one suggest in `x = (x << 0) | (x << 1) | (x << 2) | (x << 3);` `x<<0` is redundant. — , Mar 10 '19 at 15:05
@Gox It is there just for symmetry. In the final code it is replaced with `(x << 4) - x` — , Mar 10 '19 at 15:09
One line solution: `return ((((((val | (val << 12)) | ((val | (val << 12)) << 6)) & 0x03030303) | ((((val | (val << 12)) | ((val | (val << 12)) << 6)) & 0x03030303) << 3)) & 0x11111111) << 4) - (((((val | (val << 12)) | ((val | (val << 12)) << 6)) & 0x03030303) | ((((val | (val << 12)) | ((val | (val << 12)) << 6)) & 0x03030303) << 3)) & 0x11111111);` based this answer. :) — , Mar 10 '19 at 19:56

score 3 · Answer 3 · 2019-03-07T22:08:31.967

This would work:

unsigned int eToTW (unsigned char a) {
    unsigned int output = 0;

    output |= a & 0x80 ? ((unsigned) 0xf) << 28 : 0x0;
    output |= a & 0x40 ? 0xf << 24 : 0x0;
    output |= a & 0x20 ? 0xf << 20 : 0x0;
    output |= a & 0x10 ? 0xf << 16 : 0x0;

    output |= a & 0x8 ? 0xf << 12 : 0x0;
    output |= a & 0x4 ? 0xf << 8 : 0x0;
    output |= a & 0x2 ? 0xf << 4 : 0x0;
    output |= a & 0x1 ? 0xf : 0x0;

    return output;
}

or this:

unsigned int eToTW (unsigned char a) {
    unsigned int output = 0;

    output |= a & (1 << 7) ? ((unsigned) 0xf) << 28 : 0x0;
    output |= a & (1 << 6) ? 0xf << 24 : 0x0;
    output |= a & (1 << 5) ? 0xf << 20 : 0x0;
    output |= a & (1 << 4) ? 0xf << 16 : 0x0;

    output |= a & (1 << 3) ? 0xf << 12 : 0x0;
    output |= a & (1 << 2) ? 0xf << 8 : 0x0;
    output |= a & (1 << 1) ? 0xf << 4 : 0x0;
    output |= a & 1 ? 0xf : 0x0;

    return output;
}

yet another solution:

unsigned int eToTW (unsigned char a) {     
    return (a & 1 << 7 ? ((unsigned) 0xf) << 28 : 0x0) | 
           (a & 1 << 6 ? 0xf << 24 : 0x0) | 
           (a & 1 << 5 ? 0xf << 20 : 0x0) | 
           (a & 1 << 4 ? 0xf << 16 : 0x0) | 
           (a & 1 << 3 ? 0xf << 12 : 0x0) |
           (a & 1 << 2 ? 0xf << 8 : 0x0) |
           (a & 1 << 1 ? 0xf << 4 : 0x0) |
           (a & 1 ? 0xf : 0x0);
}

`0xf << 28` has undefined behavior: **C17 6.5.7 Bitwise shift operators** *The result of `E1 << E2` is `E1` left-shifted `E2` bit positions; vacated bits are filled with zeros. If `E1` has an unsigned type, the value of the result is `E1` × 2`E2`, reduced modulo one more than the maximum value representable in the result type. If `E1` has a signed type and nonnegative value, and `E1 × 2`E2` is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.* `0xf` has type `int`, `0xf << 28` is UB on 32-bit systems. Use `0xfU` to avoid this problem. — chqrlie, Mar 07 '19 at 21:37

njuffa · Answer 4 · 2019-03-08T09:27:34.883

A lookup table, as suggested in the answer by rici will provide the highest performance on most platforms. If you prefer a bit-twiddling approach, the optimal solution will depend on the hardware capabilities of your processor, e.g. how fast are shifts, does it have three-input logic operations (like my GPU), how many integer instructions can it execute in parallel? One solution is to transport each bit to the lsb of its target nibble, then fill in each nibble with its lsb value in a second step (a tip of the hat to chqrlie for suggesting the use of lsb instead of msb):

#include <stdint.h>
uint32_t expand_bits_to_nibbles (uint8_t x)
{
    uint32_t r;
    /* spread bits to lsb in each nibble */
    r = ((((uint32_t)x << (4*0-0)) & (1u << (4*0))) |
         (((uint32_t)x << (4*1-1)) & (1u << (4*1))) |
         (((uint32_t)x << (4*2-2)) & (1u << (4*2))) |
         (((uint32_t)x << (4*3-3)) & (1u << (4*3))) |
         (((uint32_t)x << (4*4-4)) & (1u << (4*4))) |
         (((uint32_t)x << (4*5-5)) & (1u << (4*5))) |
         (((uint32_t)x << (4*6-6)) & (1u << (4*6))) |
         (((uint32_t)x << (4*7-7)) & (1u << (4*7))));
    /* fill in nibbles */
    r = (r << 4) - r;
    return r;
}

Some quick experiments with Compiler Explorer show that this leads to a particularly efficient code on PowerPC64, for example.

If the processor has a fast integer multiplier, we could use it to shift multiple bits into place at the same time. Here, we would want to use groups of three source bits to avoid collisions:

#include <stdint.h>
uint32_t expand_bits_to_nibbles_mul (uint8_t x)
{
    const uint32_t spread3 = (1u <<  6) | (1u <<  3) | (1u <<  0);
    const uint8_t bits_lo3 = (1u <<  2) | (1u <<  1) | (1u <<  0);
    const uint8_t bits_md3 = (1u <<  5) | (1u <<  4) | (1u <<  3);
    const uint8_t bits_hi2 = (1u <<  7) | (1u <<  6);
    const uint32_t nib_lsb = (1u << 28) | (1u << 24) | (1u << 20) | (1u << 16) | 
                             (1u << 12) | (1u <<  8) | (1u <<  4) | (1u <<  0);
    uint32_t r;
    /* spread bits to lsb in each nibble */
    r = (((uint32_t)(x & bits_lo3) * (spread3 <<  0)) +
         ((uint32_t)(x & bits_md3) * (spread3 <<  9)) +
         ((uint32_t)(x & bits_hi2) * (spread3 << 18))) & nib_lsb;
    /* fill in nibbles */
    r = (r << 4) - r;
    return r;
}

Another variant using integer multiply, which is potentially faster on some platforms, uses an idea from this answer. We use a multiply to spread four bits at a time, such that they land within their target nibble. However, we then have to move the bit within the nibble to the nibble's lsb before we can expand the lsb to cover the nibble. We potentially save a multiplication at the expense of additional housekeeping.

#include <stdint.h>
uint32_t expand_bits_to_nibbles_mul2 (uint8_t x)
{
    const uint32_t spread4 = (1u << 12) | (1u <<  8) | (1u <<  4) | (1u <<  0);
    const uint32_t extract = (1u << (3*4+3+16)) | (1u << (2*4+2+16)) | 
                             (1u << (1*4+1+16)) | (1u << (0*4+0+16)) |
                             (1u << (3*4+3+ 0)) | (1u << (2*4+2+ 0)) | 
                             (1u << (1*4+1+ 0)) | (1u << (0*4+0+ 0));
    const uint32_t nib_lsb = (1u << 28) | (1u << 24) | (1u << 20) | (1u << 16) |
                             (1u << 12) | (1u <<  8) | (1u <<  4) | (1u <<  0);
    const uint32_t nib_msb = (nib_lsb << 3);
    const uint8_t bits_lo4 = (1u <<  3) | (1u <<  2) | (1u <<  1) | (1u <<  0);
    const uint8_t bits_hi4 = (1u <<  7) | (1u <<  6) | (1u <<  5) | (1u <<  4);
    uint32_t r;
    /* spread bits to their target nibbles */
    r = (((uint32_t)(x & bits_lo4) * (spread4 <<  0)) +  
         ((uint32_t)(x & bits_hi4) * (spread4 << 12)));
    /* extract appropriate bit in each nibble and move it into nibble's lsb */
    r = (((r & extract) + (nib_msb - extract)) >> 3) & nib_lsb;
    /* fill in each nibble with its lsb */
    r = (r << 4) - r;
    return r;
}

[PIC18 doesn't have a barrel shifter](https://en.wikipedia.org/wiki/PIC_instruction_listings#PIC18_high_end_core_devices_(16_bit)), so bit twiddling will be much worse than using a lookup table, as [shift by 4 will be slower than shift by 1](https://stackoverflow.com/q/3726326/995714) — phuclv, Mar 19 '19 at 16:02

Bit duplication from 8-bit to 32-bit

4 Answers4

Linked