2

I have an array of bytes where each byte is either 0 or 1. Now I want to pack these values into bits, so that 8 original bytes occupy 1 target byte, with original byte 0 going into bit 0, byte 1 into bit 1, etc. So far I have the following in the kernel:

const uint16_t tid = threadIdx.x;
__shared__ uint8_t packing[cBlockSize];

// ... Computation of the original bytes in packing[tid]
__syncthreads();

if ((tid & 4) == 0)
{
    packing[tid] |= packing[tid | 4] << 4;
}
if ((tid & 6) == 0)
{
    packing[tid] |= packing[tid | 2] << 2;
}
if ((tid & 7) == 0)
{
    pOutput[(tid + blockDim.x*blockIdx.x)>>3] = packing[tid] | (packing[tid | 1] << 1);
}

Is this correct and efficient?

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158

2 Answers2

8

The __ballot() warp-voting function comes quite handy for this. Assuming that you can redefine pOutput to be of uint32_t type, and that your block size is a multiple of the warp size (32):

unsigned int target = __ballot(packing[tid]);
if (tid % warpSize == 0) {
    pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
}

Strictly speaking, the if conditional isn't even necessary, as all threads of the warp will write the same data to the same address. So a highly optimized version would just be

pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = __ballot(packing[tid]);
tera
  • 7,080
  • 1
  • 21
  • 32
  • 1
    Great solution, thanks. Then I don't need shared memory and `__syncthreads()`. – Serge Rogatch Sep 14 '16 at 11:26
  • As I understand, this doesn't scale to packing 2-bit values? Even though we can use 2 `__ballot` calls, to get lower and higher bits in 2 separate 32-bit variables, then interleaving the bits (e.g. with http://stackoverflow.com/questions/39490345/interleave-bits-efficiently ) is more expensive on CUDA than the algorithm I gave in the question. For packing 2-bit values, that algorithm would need to write the output inside `if ((tid & 6) == 0)`. – Serge Rogatch Sep 15 '16 at 06:30
  • You can use [shuffle instructions](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions) to move the data into position before the ballot. – tera Sep 15 '16 at 08:39
  • How is the data stored in each thread? Two bytes, one for each bit? One value where you want to take the lowest two bits? – tera Sep 15 '16 at 08:41
  • @tera, I can store it either way, the conversion is easy. Initially it is 1 value where I want to take lower 2 bits. – Serge Rogatch Sep 15 '16 at 11:11
  • I've put my take on it in a separate answer, so I can format the code. – tera Sep 15 '16 at 15:12
  • maxwell and pascal have [faster shared memory atomics](https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/) so a shared [`atomicOr`](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicor) approach should be able to flexibly handle an arbitrary number (or arrangement) of bits per thread. – Robert Crovella Sep 15 '16 at 20:22
1

For two bits per thread, using uint2 *pOutput

int lane = tid % warpSize;
uint2 target;
target.x = __ballot(__shfl(packing[tid], lane / 2)                & (lane & 1) + 1));
target.y = __ballot(__shfl(packing[tid], lane / 2 + warpSize / 2) & (lane & 1) + 1));
pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;

You'll have to benchmark whether this is still faster than your conventional solution.

tera
  • 7,080
  • 1
  • 21
  • 32