5

I have the following code to turn a bit into a byte.

__device__ UINT64 bitToByte(const UINT8 input) {
    UINT64 b = ((0x8040201008040201ULL * input) >> 7) & 0x0101010101010101ULL; 
    //reverse the byte order <<-- this step is missing
    return b;
}

However the bytes are in the wrong order, the endianness is reversed. On the CPU I can simply to a bswap reg,reg to fix this, but what do I do on the GPU?

Alternatively, what similar trick can I use so that the bytes are put the right way round, i.e. the Most Significant bit goes to the Most Significant Byte, such that I don't need a bswap trick.

Johan
  • 74,508
  • 24
  • 191
  • 319
  • A naive thought maybe, but can you not do the `bswap reg,reg` step first, so that the endianness is correct and then do the bit-to-byte conversion. – Duck Dodgers Jan 28 '19 at 09:22
  • @JoeyMallone, I don't have a bswap instruction, all my code runs on the GPU. Hence the `CUDA` tag and the `__device__` annotation on the function. – Johan Jan 28 '19 at 09:31
  • 1
    The prmt PTX instruction allows doing this. I don't have time to find you the links now, but I've used it in [this answer](https://stackoverflow.com/a/40198552/1662425), which might give you just enough info to work it out yourself. – tera Jan 28 '19 at 10:19
  • 1
    @tera, ah yess, `prmt` maps to the intrinsic `__byte_perm(a32,b32,swap)` instruction. That will work. – Johan Jan 28 '19 at 10:24

3 Answers3

3

Thanks to @tera, here is the answer:

//Expand every bit into a byte
__device__ static UINT64 Add012(const UINT8 input) {
    const UINT64 b = ((0x8040201008040201ULL * input) >> 7) & 0x0101010101010101ULL; //extract every bit into a byte
    //unfortunatly this returns the wrong byte order
    UINT32* const b2 = (UINT32*)&b;
    UINT64 Result;
    UINT32* const Result2 = (UINT32*)&Result;
    Result2[0] = __byte_perm(b2[0]/*LSB*/, b2[1], 0x4567);  //swap the bytes around, the MSB's go into the LSB in reverse order
    Result2[1] = __byte_perm(b2[0]/*LSB*/, b2[1], 0x0123);  //and the LSB -> MSB reversed.
    return Result;
}

The __byte_perm replaces the bswap instruction.

Alternatively the input can be reversed using the __brev (bit-reverse) intrinsic:

//Expand every bit into a byte
__device__ static UINT64 Add012(const UINT8 input) {
    const UINT32 reversed = (__brev(input) >> 24);
    return ((0x8040201008040201ULL * reversed) >> 7) & 0x0101010101010101ULL; //extract every bit into a byte
}

The second version looks easier.

Johan
  • 74,508
  • 24
  • 191
  • 319
2

Instead of reverse the result, you can reverse input, with any of the tricks explained here. For example, using the approach of this answer:

static UINT8 lookup[16] = {
    0x0, 0x8, 0x4, 0xc, 0x2, 0xa, 0x6, 0xe,
    0x1, 0x9, 0x5, 0xd, 0x3, 0xb, 0x7, 0xf, };

UINT8 reverse(UINT8 n) {
    return (lookup[n & 0xF] << 4) | lookup[n >> 4];
}

__device__ UINT64 bitToByte(const UINT8 input) {
    UINT64 b = ((0x8040201008040201ULL * reverse(input)) >> 7) & 0x0101010101010101ULL; 
    return b;
}
Giovanni Cerretani
  • 1,693
  • 1
  • 16
  • 30
2

To reverse byte order, the bit extraction can be done with the same trick, but by swapping the coefficients that perform the shift in the multiplication. However, to avoid clashes in the multiplication, it must be done in two steps, for the even and odd bits. This way, 2 bytes are free to hold the result of the every multiplication which is sufficient to ensure integrity of the result.

__device__ UINT64 bitToByte(const UINT8 input) {
  UINT64 b = ( ((0x0002000800200080ULL * input) >> 7) & 0x0001000100010001ULL) 
          |  ( ((0x0100040010004000ULL * input) >> 7) & 0x0100010001000100ULL);

    return b;
}

As spotted in the comments, to optimize, the shifts can be factorized.

__device__ UINT64 bitToByte(const UINT8 input) {
  UINT64 b =  ( ((0x0002000800200080ULL * input) & 0x0080008000800080ULL) 
              | ((0x0100040010004000ULL * input) & 0x8000800080008000ULL) )
                >> 7 ;
    return b;
}
talonmies
  • 70,661
  • 34
  • 192
  • 269
Alain Merigot
  • 10,667
  • 3
  • 18
  • 31
  • Nice! I did try this, but ran into clashes with the multiplicands overflowing into the adjacent byte. This solves that. – Johan Jan 28 '19 at 11:41
  • Makes sense. In `0x8040201008040201` the 1 bits are 9 apart, but in `0x0102040810204080` they're only 7 apart. In this way you keep them 14 bits apart. – MSalters Jan 28 '19 at 13:16
  • You can eliminate two operations by factoring the ` >>7` into the constants here. But the optimizer might spot that too. – MSalters Jan 28 '19 at 13:19
  • @MSalters Right! But the masks must be changed and I doubt the optimizer can do that. Updated the answer with this suggestions. – Alain Merigot Jan 28 '19 at 13:50