First of all, you wouldn't want to use it in a loop even if it was possible, and you wouldn't want to fully unroll a loop with 16x pextrb
. That instruction costs 2 uops on Intel and AMD CPUs, and will bottleneck on the shuffle port (and port 0 for vec->int data transfer).
The _mm_extract_epi8
intrinsic requires a compile-time constant index because the pextrb r32/m8, xmm, imm8
instruction is only available with the index as an immediate (embedded into the machine code of the instruction).
If you want to give up on SIMD and write a scalar loop over vector elements, for this many elements you should store/reload. So you should write it that way in C++:
alignas(16) int8_t bytes[16]; // or uint8_t
_mm_store_si128((__m128i*)bytes, vec);
for(int i=0 ; i<16 ; i++) {
foo(bytes[i]);
}
The cost of one store (and the store-forwarding latency) is amortized over 16 reloads which only cost 1 movsx eax, byte ptr [rsp+16]
or whatever each. (1 uop on Intel and Ryzen). Or use uint8_t
for movzx
zero-extension to 32-bit in the reloads. Modern CPUs can run 2 load uops per clock, and vector-store -> scalar reload store forwarding is efficient (~6 or 7 cycle latency).
With 64-bit elements, movq
+ pextrq
is almost certainly your best bet. Store + reloads are comparable cost for the front-end and worse latency than extract.
With 32-bit elements, it's closer to break even depending on your loop. An unrolled ALU extract could be good if the loop body is small. Or you might store/reload but do do the first element with _mm_cvtsi128_si32
(movd
) for low latency on the first element so the CPU can be working on that while the store-forwarding latency for the high elements happens.
With 16-bit or 8-bit elements, it's almost certainly better to store/reload if you need to loop over all 8 or 16 elements.
If your loop makes a non-inline function call for each element, the Windows x64 calling convention has some call-preserved XMM registers, but x86-64 System V doesn't. So if your XMM reg would need to be spilled/reloaded around a function call, it's much better to just do scalar loads since the compiler will have it in memory anyway. (Hopefully it can optimize away the 2nd copy of it, or you could declare a union.)
See
print a __m128i variable for working store + scalar loops for all element sizes.
If you actually want a horizontal sum, or min or max, you can do it with shuffles in O(log n) steps, rather than n scalar loop iterations. Fastest way to do horizontal float vector sum on x86 (also mentions 32-bit integer).
And for summing byte elements, SSE2 has a special case of _mm_sad_epu8(vec, _mm_setzero_si128())
. Sum reduction of unsigned bytes without overflow, using SSE2 on Intel.
You can also use that to do signed bytes by range-shifting to unsigned and then subtracting 16*0x80
from the sum. https://github.com/pcordes/vectorclass/commit/630ca802bb1abefd096907f8457d090c28c8327b