I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1};
I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8 (__m64 a)
function I can create a mask but this intrinsic only takes a msb of a byte not lsb. Is there a similar intrinsic or efficient method to extract lsb to create single 8 bit integer?
-
Can you show an example output for your input? Do you want to transform `{0,1,0,1}` into `5ULL`? Usually the best you can do is express it in a simple for loop with known boundaries. The compiler will be able to vectorize it for the architecture you compile to (using Intel intrinsics won't be useful if later you want to port your program to ARM for example). – Jorge Bellon Aug 30 '18 at 12:01
-
yes.. for the above example , I want to transform {0,1,0,1,0,1,0,1} to 85. – yadhu Aug 30 '18 at 12:03
-
I don't want to port to ARM platform. I want convert it integer, then use that integer as index to look up table. so... – yadhu Aug 30 '18 at 12:05
2 Answers
There is no direct way to do it, but obviously you can simply shift the lsb into the msb and then extract it:
_mm_movemask_pi8(_mm_slli_si64(x, 7))
Using MMX these days is strange and should probably be avoided.
Here is an SSE2 version, still reading only 8 bytes:
int lsb_mask8(uint8_t* bits) {
__m128i x = _mm_loadl_epi64((__m128i*)bits);
return _mm_movemask_epi8(_mm_slli_epi64(x, 7));
}
Using SSE2 instead of MMX avoids the needs for EMMS

- 61,398
- 6
- 86
- 164
-
Isn't shift expensive? In the perf report i see that this shift is shown as expensive? Although AMD optimization manual says that it has one cycle latency. – yadhu Aug 30 '18 at 12:27
-
2@yadhu the shift shouldn't be slow itself, can you confirm that it is actually the shift and not the usual "misattribution of cost" that shifts the blame from a slow load to the operation that uses the result of a load? – harold Aug 30 '18 at 12:33
-
I was also suspecting the same, Is there a way to avoid "misattribution of cost" which shifts the blame. – yadhu Aug 30 '18 at 12:36
-
1@yadhu not that I know of, but there are experiments you could do that distinguish between different sources of the cost, such as looking at the effect of removing the shift – harold Aug 30 '18 at 12:38
If you have efficient BMI2 pext
(e.g. Haswell and newer, same as AVX2), then use the inverse of @wim's answer on your question about going the other direction (How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD).
unsigned extract8LSB(uint8_t *arr) {
uint64_t bytes;
memcpy(&bytes, arr, 8);
unsigned LSBs = _pext_u64(bytes ,0x0101010101010101);
return LSBs;
}
This compiles like you'd expect to a qword load + a pext
instruction. Compilers will hoist the 0x01...
constant setup out of a loop after inlining.
pext
/ pdep
are efficient on Intel CPUs that support them (3 cycle latency / 1c throughput, 1 uop, same as a multiply). But they're not efficient on AMD, like 18c latency and throughput. (https://agner.org/optimize/). If you care about AMD, you should definitely use @harold's pmovmskb
answer.
Or if you have multiple contiguous blocks of 8 bytes, do them with a single wide vector, and get a 32-bit bitmap. You can split that up if needed, or unroll the loop using by 4, to right-shift the bitmap to get all 4 single-byte results.
If you're just storing this to memory right away, then you should probably have done this extraction in the loop that wrote the source data, instead of a separate loop, so it would still be hot in cache. AVX2 _mm256_movemask_epi8
is a single uop (on Intel CPUs) with low latency, so if your data isn't hot in L1d cache then a loop that just does this would not be keeping its execution units busy while waiting for memory.

- 328,167
- 45
- 605
- 847