I've got a packed bit array stored as a 32 bit word. I'd like to expand it into an array of bytes, where each byte corresponds to one of the bits of the array. Here's an example to illustrate what I mean (showing only 8 bits for brevity):
int packedArray = 0b01101010;
char outputArray[8] = {0x00, 0xFF, 0x00, 0xFF, 0x00, 0xFF, 0xFF, 0x00};
As you can see, outputArray
's i
th entry corresponds to packedArray
's i
th bit. I'd like to use SIMD instructions to accomplish this efficiently for a 32 bit packed array. I've got the following code currently:
// 0th byte has only lowest bit, 1st byte has 2nd lowest bit, 2nd byte has 3rd lowest bit, etc
__m256i maskBits = _mm256_set1_epi64x(0x8040201008040201LL);
__m256i zeros = _mm256_set1_epi8(0x00);
__m256i ones = _mm256_set1_epi8(0xFF);
__m128i bitArray = _mm_loadu_si128(&packedArray);
__m256i A = _mm256_broadcastb_epi8(bitArray); // spread the byte across the entire 256 bit vector
bitArray = _mm_srli_epi32(bitArray, 8); // shift over to get the next byte
__m256i B = _mm256_broadcastb_epi8(bitArray);
bitArray = _mm_srli_epi32(bitArray, 8);
__m256i C = _mm256_broadcastb_epi8(bitArray);
bitArray = _mm_srli_epi32(bitArray, 8);
__m256i D = _mm256_broadcastb_epi8(bitArray);
A = _mm256_blend_epi32(A, B, 0b11001100); // put A in the lowest 64 bits and B in the next lowest 64 bits
C = _mm256_blend_epi32(C, D, 0b11001100);
__m256i fullBytes = _mm256_blend_epi32(A, C, 0b11110000); // put A/B in the bottom bits and C/D in the top bits
fullBytes = _mm256_and_si256(fullBytes, maskBits); // extract each packed bit to be the only one present in the byte
fullBytes = _mm256_cmpeq_epi8(fullBytes, zeros); // check which bytes have a 1, filling with zeros if it does (1s otherwise)
fullBytes = _mm256_xor_si256(fullBytes, ones); // bitwise not
This feels pretty convoluted. I first have to copy each byte across the vector, and then use a mask to extract each bit. Is there a better way to do all of this? It feels like this should be a pretty common task. It's also possible that I shouldn't even try to use SIMD for this, but I would assume it will be faster than a sequential approach. Any ideas?