How to use _mm_extract_epi8 function?

Question

I am using _mm_extract_epi8 (__m128i a, const int imm8) function, which has const int parameter. When I compile this c++ code, getting the following error message:

Error C2057 expected constant expression

__m128i a;

for (int i=0; i<16; i++)
{
    _mm_extract_epi8(a, i); // compilation error
}

How could I use this function in loop?

You can't, i is not a constant. Unless you unroll your loop and write `_mm_extract_epi8(a,0)`, `_mm_extract_epi8(a,1)`, etc. — Marc Glisse, Feb 02 '19 at 12:55
Non-portably, you could use `a.m128i_u8[i]`. With other compilers, you could use a union of __m128i and char[16]. But it is not a good sign if you need to iterate on vector elements like this. — Marc Glisse, Feb 02 '19 at 13:10
The portable way is to `memcpy` the contents to a `char[16]` array and access the elements. In `C` a `union` is fine as well. What do you actually want to do? — chtz, Feb 02 '19 at 13:19
@chtz: not memcpy, `_mm_storeu_si128` :P Or `alignas(16) char bytes[16]` so you can use `_mm_store_si128`. If you want to loop over the bytes of a vector one at a time, store/reload is more efficient anyway than an unrolled 16x `pextrb`. — Peter Cordes, Feb 02 '19 at 18:20
@PeterCordes I guess for `movdqu`/`_mm_storeu_si128` it does not really matter (https://godbolt.org/z/Cg8AuX) -- for aligned stores it makes sense of course. I never understood why `_mm_storeu_si128` takes the destination as `__m128i*` and not `char*` or `void*`, i.e., always requires casting. — chtz, Feb 02 '19 at 21:56
@chtz: you don't want it to split across a cache-line boundary. On modern CPUs that usually won't defeat store-forwarding, but it will still consume extra resources when actually committing to L1d. Re: the definition: I'm not impressed with Intel's intrinsics either. So many poor decisions of what to provide (e.g. no memory-source `pmovzx*` intrinsic, and compilers have a hard time optimizing away the movd or movq load to create a `__m128i`), and clunky naming. Finally with AVX512, though, integer loads/stores take `void*`. — Peter Cordes, Feb 02 '19 at 21:59
@PeterCordes Actually, even the `_mm_store_si128` intrinsic does not really make any sense. If I need to cast the pointer to `__m128i*` anyway, I might as well just assign something directly to it (unless I want my code to be more verbose): https://godbolt.org/z/kbXXPV. But I agree that the poor API design of (many) intrinsics is an independent issue. — chtz, Feb 03 '19 at 13:08
@chtz: oh, yes, dereferencing a `__m128i*` is exactly equivalent to `_mm_load/store_si128`. In GNU compilers, that's how the intrinsic is implemented. (While loadu/storeu cast it to `__m128i_u` which has an `__aligned__ (1))` attribute.) Instead of having to memorize this, the load/store intrinsics seem like a reasonable way to communicate alignment info to the compiler. And to make your C source look like you're writing asm for a 3-operand load/store machine with `__m128i` variables as vector registers, even though that's *not* really the case. — Peter Cordes, Feb 03 '19 at 22:22

Peter Cordes · Answer 1 · 2019-02-02T18:48:08.730

First of all, you wouldn't want to use it in a loop even if it was possible, and you wouldn't want to fully unroll a loop with 16x pextrb. That instruction costs 2 uops on Intel and AMD CPUs, and will bottleneck on the shuffle port (and port 0 for vec->int data transfer).

The _mm_extract_epi8 intrinsic requires a compile-time constant index because the pextrb r32/m8, xmm, imm8 instruction is only available with the index as an immediate (embedded into the machine code of the instruction).

If you want to give up on SIMD and write a scalar loop over vector elements, for this many elements you should store/reload. So you should write it that way in C++:

alignas(16) int8_t bytes[16];  // or uint8_t
_mm_store_si128((__m128i*)bytes, vec);
for(int i=0 ; i<16 ; i++) {
    foo(bytes[i]);
}

The cost of one store (and the store-forwarding latency) is amortized over 16 reloads which only cost 1 movsx eax, byte ptr [rsp+16] or whatever each. (1 uop on Intel and Ryzen). Or use uint8_t for movzx zero-extension to 32-bit in the reloads. Modern CPUs can run 2 load uops per clock, and vector-store -> scalar reload store forwarding is efficient (~6 or 7 cycle latency).

With 64-bit elements, movq + pextrq is almost certainly your best bet. Store + reloads are comparable cost for the front-end and worse latency than extract.

With 32-bit elements, it's closer to break even depending on your loop. An unrolled ALU extract could be good if the loop body is small. Or you might store/reload but do do the first element with _mm_cvtsi128_si32 (movd) for low latency on the first element so the CPU can be working on that while the store-forwarding latency for the high elements happens.

With 16-bit or 8-bit elements, it's almost certainly better to store/reload if you need to loop over all 8 or 16 elements.

If your loop makes a non-inline function call for each element, the Windows x64 calling convention has some call-preserved XMM registers, but x86-64 System V doesn't. So if your XMM reg would need to be spilled/reloaded around a function call, it's much better to just do scalar loads since the compiler will have it in memory anyway. (Hopefully it can optimize away the 2nd copy of it, or you could declare a union.)

See print a __m128i variable for working store + scalar loops for all element sizes.

If you actually want a horizontal sum, or min or max, you can do it with shuffles in O(log n) steps, rather than n scalar loop iterations. Fastest way to do horizontal float vector sum on x86 (also mentions 32-bit integer).

And for summing byte elements, SSE2 has a special case of _mm_sad_epu8(vec, _mm_setzero_si128()). Sum reduction of unsigned bytes without overflow, using SSE2 on Intel.

You can also use that to do signed bytes by range-shifting to unsigned and then subtracting 16*0x80 from the sum. https://github.com/pcordes/vectorclass/commit/630ca802bb1abefd096907f8457d090c28c8327b

wim · Answer 2 · 2019-02-04T12:16:39.230

Intrinsic _mm_extract_epi8() cannot be used with variable indices, as already pointed out in the comments. You can use the solution below instead, but I would use this solution only in a non-performance critical loop, such as, for example, printing results to file or screen.

Actually, in practice it is almost never necessary to loop over the byte elements of an xmm. For example, the following operations on epi8 do not need a loop over the elements (the examples may contain some self promotion):

Horizontal minimum, maximum, sum, sum of absolute values, root mean square, avarage, bitand, bitor.
Prefix sum.
Computing the most frequentlty occurring element (the modus).
Variabele bit shift.
Create a mask based on byte values..
Computing the indices of the nonzero elements.
Etc. ect.

In these cases efficient vectorized solutions are possible.

If you cannot avoid a loop over the elements in a performance critical loop: Peter Cordes' solution should be faster than the one below, at least if you have to extract many (2 or more) elements.

#include <stdio.h>
#include <stdint.h>
#include <immintrin.h>
/* gcc -m64 -O3 -march=nehalem extr_byte.c */

uint8_t mm_extract_epi8_var_indx(__m128i vec, int i )
{   
    __m128i indx = _mm_cvtsi32_si128(i);
    __m128i val  = _mm_shuffle_epi8(vec, indx);
            return (uint8_t)_mm_cvtsi128_si32(val);
}  

int main()
{
    int i;
    __m128i x = _mm_set_epi8(36,35,34,33,  32,31,30,  29,28,27,26,  25,24,23,22,21);
    uint8_t t; 

    for (i = 0; i < 16; i++){
        printf("x_%i = ", i);
        t = mm_extract_epi8_var_indx(x, i);
        printf("%i \n", t);
    }
    return 0;
}

Result:

$ ./a.out
x_0 = 21 
x_1 = 22 
x_2 = 23 
x_3 = 24 
x_4 = 25 
x_5 = 26 
x_6 = 27 
x_7 = 28 
x_8 = 29 
x_9 = 30 
x_10 = 31 
x_11 = 32 
x_12 = 33 
x_13 = 34 
x_14 = 35 
x_15 = 36

If we had dword variable-shuffles in SSE2, we could have used that (and a right shift to handle the byte-within-word part of the index), but we don't get `vpermd` until AVX2. But then sure for variable access to a byte in a YMM or ZMM maybe. I think this is better than `vpcompressd zmm0{k1}, zmm1` with `k1 = 1< — Peter Cordes, Feb 03 '19 at 07:59

How to use _mm_extract_epi8 function?

2 Answers2

Linked