1

I've got a packed bit array stored as a 32 bit word. I'd like to expand it into an array of bytes, where each byte corresponds to one of the bits of the array. Here's an example to illustrate what I mean (showing only 8 bits for brevity):

int packedArray = 0b01101010;
char outputArray[8] = {0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0xFF, 0x00};

As you can see, outputArray's ith entry corresponds to packedArray's ith bit. I'd like to use SIMD instructions to accomplish this efficiently for a 32 bit packed array. I've got the following code currently:

// 0th byte has only lowest bit, 1st byte has 2nd lowest bit, 2nd byte has 3rd lowest bit, etc
__m256i maskBits = _mm256_set1_epi64x(0x8040201008040201LL);
__m256i zeros = _mm256_set1_epi8(0x00);
__m256i ones = _mm256_set1_epi8(0xFF);

__m128i bitArray = _mm_loadu_si128(&packedArray);

__m256i A = _mm256_broadcastb_epi8(bitArray);   // spread the byte across the entire 256 bit vector
bitArray = _mm_srli_epi32(bitArray, 8);         // shift over to get the next byte
__m256i B = _mm256_broadcastb_epi8(bitArray);
bitArray = _mm_srli_epi32(bitArray, 8);
__m256i C = _mm256_broadcastb_epi8(bitArray);
bitArray = _mm_srli_epi32(bitArray, 8);
__m256i D = _mm256_broadcastb_epi8(bitArray);

A = _mm256_blend_epi32(A, B, 0b11001100);   // put A in the lowest 64 bits and B in the next lowest 64 bits
C =  _mm256_blend_epi32(C, D, 0b11001100);
__m256i fullBytes = _mm256_blend_epi32(A, C, 0b11110000);   // put A/B in the bottom bits and C/D in the top bits

fullBytes = _mm256_and_si256(fullBytes, maskBits);  // extract each packed bit to be the only one present in the byte
fullBytes = _mm256_cmpeq_epi8(fullBytes, zeros);    // check which bytes have a 1, filling with zeros if it does (1s otherwise)
fullBytes = _mm256_xor_si256(fullBytes, ones);      // bitwise not

This feels pretty convoluted. I first have to copy each byte across the vector, and then use a mask to extract each bit. Is there a better way to do all of this? It feels like this should be a pretty common task. It's also possible that I shouldn't even try to use SIMD for this, but I would assume it will be faster than a sequential approach. Any ideas?

multitaskPro
  • 569
  • 4
  • 14
  • 2
    `vpshufb` can do a lot of that broadcasting in fewer steps. BTW what do you use the unpacked array for? Just checking because many things that are traditionally done with them are nowadays also feasible with packed bits. – harold Jan 28 '23 at 20:49
  • @harold I'm writing the output directly to a buffer as pixels. I don't think OpenGL has a packed bit format, although that would be very cool if it did – multitaskPro Jan 28 '23 at 21:17

0 Answers0