7

I am using the AVX intrinsic _mm256_extract_epi32().

I am not entirely sure if I am using it correctly, though, because gcc doesn't like my code, whereas clang compiles it and runs it without issue.

I am extracting the lane based on the value of an integer variable, as opposed to using a constant.

When compiling the following snippet with clang3.8 (or clang4) for avx2, it generates code and uses the vpermd instruction.

#include <stdlib.h>
#include <immintrin.h>
#include <stdint.h>

uint32_t foo( int a, __m256i vec )
{
    uint32_t e = _mm256_extract_epi32( vec, a );
    return e*e;
}

Now, if I use gcc instead, let's say gcc 7.2 then the compiler fails to generate code, with the errors:

In file included from /opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/immintrin.h:41:0,
                 from <source>:2:
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/avxintrin.h: In function 'foo':
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/avxintrin.h:524:20: error: the last argument must be a 1-bit immediate
   return (__m128i) __builtin_ia32_vextractf128_si256 ((__v8si)__X, __N);
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/immintrin.h:37:0,
                 from <source>:2:
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/include/smmintrin.h:449:11: error: selector must be an integer constant in the range 0..3
    return __builtin_ia32_vec_ext_v4si ((__v4si)__X, __N);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have two issues with this:

  1. Why is clang fine with using a variable, and does gcc want a constant?
  2. Why can't gcc make up its mind? First it demands a 1-bit immediate value, and later it wants an integer constant in the range 0..3 and those are different things.

Intels Intrinsics Guide doesn't specify constraints on the index value for _mm256_extract_epi32() by the way, so who's right here, gcc or clang?

Bram
  • 7,440
  • 3
  • 52
  • 94
  • Note: clang5 also compiles the code, but makes a round trip to memory to extract the lane. – Bram Feb 10 '18 at 22:09

3 Answers3

7

Apparently GCC and Clang made a different choice. IMHO GCC has made the right choice by not implementing this for variable indices. Intrinsic _mm256_extract_epi32 doesn't translate to a single instruction. With a variable index this intrinsic might lead to inefficient code, if it is used in a performance critical loop.

For example, Clang 3.8 needs 4 instructions to implement _mm256_extract_epi32 with a variable index. GCC forces the programmer to think about more efficient code that avoids _mm256_extract_epi32 with variable indices.

Nevertheless, sometimes it is useful to have a portable (gcc, clang, icc) function, which emulates _mm256_extract_epi32 with variable a index:

uint32_t mm256_extract_epi32_var_indx(const __m256i vec, const unsigned int i) {   
    __m128i indx = _mm_cvtsi32_si128(i);
    __m256i val  = _mm256_permutevar8x32_epi32(vec, _mm256_castsi128_si256(indx));
    return         _mm_cvtsi128_si32(_mm256_castsi256_si128(val));
}    

This should compile to three instructions after inlining: two vmovds and a vpermd (gcc 8.2 with -m64 -march=skylake -O3):

mm256_extract_epi32_var_indx:
  vmovd xmm1, edi
  vpermd ymm0, ymm1, ymm0
  vmovd eax, xmm0
  vzeroupper
  ret

Note that the intrinsics guide describes that the result is 0 for indices >=8 (which is an unusual case anyway). With Clang 3.8, and with mm256_extract_epi32_var_indx, the index is reduced modulo 8. In other words: only the 3 least significant bits of the index are used. Note that Clang 5.0's round trip to memory isn't very efficient too, see this Godbolt link. Clang 7.0 fails to compile _mm256_extract_epi32 with variable indices.

As @Peter Cordes commented: with a fixed index 0, 1, 2, or 3, only a single pextrd instruction is needed to extract the integer from the xmm register. With a fixed index 4, 5, 6, or 7, two instructions are required. Unfortunately, a vpextrd instruction working on 256-bit ymm registers doesn't exist.


The next example illustrates my answer:

A naive programmer starting with SIMD intrinsics might write the following code to sum the elements 0, 1, ..., j-1, with j<8, from vec.

#include <stdlib.h>
#include <immintrin.h>
#include <stdint.h>

uint32_t foo( __m256i vec , int j)
{   
    uint32_t sum=0;
    for (int i = 0; i < j; i++){
        sum = sum + (uint32_t)_mm256_extract_epi32( vec, i );
    }
    return sum;
}

With Clang 3.8 this compiles to about 50 instructions with branches and loops. GCC fails to compile this code. Obviously an efficient code to sum these elements is likely based on:

  1. mask out the elements j, j+1, ..., 7, and
  2. compute the horizontal sum.
wim
  • 3,702
  • 19
  • 23
  • 1
    It would be nice if there was more portable syntax for indexing vector elements, for debug-print functions. `_mm256_extract_epi32` seems like a reasonable place to put that overload, if compilers agreed with each other, because [the `vpextrd` machine instruction](https://github.com/HJLebbink/asm-dude/wiki/PEXTRB_PEXTRD_PEXTRQ) is only available with XMM operands anyway, not 256 bit YMM. `_mm256_extract_epi32` can't be a single instruction for indices >= 4. – Peter Cordes Feb 10 '18 at 23:42
  • Thanks, but not quite following you: if we ignore the >=8 index issue, surely an AND with 0x7, followed by VPERMD is pretty efficient (like clang outputs) ? I tried adding assert(a<8) to hint the compiler to leave out the AND, but it didn't. (btw, in my code, index is always less than 8) – Bram Feb 11 '18 at 01:40
  • In the old off-line intrinsics guide there was a warning displayed for the composite intrinsics: _Note: This intrinsic creates a sequence of two or more instructions, and may perform worse than a native instruction. Consider the performance impact of this intrinsic._ . I think the same warning applies here. Of course, the performance impact is negligible if this intrinsic isn't used too much. On Skylake the latency of the 2 `vmovd`-s and the `vpermd` is 7 cycles – wim Feb 11 '18 at 03:40
6

The __N it says must be a 1-bit immediate is not the 2nd arg to _mm256_extract_epi32, it's some function of that used as an arg to __builtin_ia32_vextractf128_si256 (presumably the 3rd bit). Then later it wants an integer constant in the 0..3 range for vpextrd, giving you a total of 3 bits of index.

_mm256_extract_epi32 is a composite intrinsic, not directly defined in terms of a single builtin function.

vpextrd r32, ymm, imm8 doesn't exist, only the xmm version exists, so _mm256_extract_epi32 is a wrapper around vextracti/f128 / vpextrd. Gcc chooses to only make it work for compile-time constants so it always compiles to at most 2 instructions.

If you want runtime-variable vector indexing, you need to use different syntax; e.g. store to an array and load a scalar, and hope gcc optimizes that into a shuffle / extract.

Or define a GNU C native vector type with the right element width, and use foo[i] to index it like an array.

typedef int v8si __attribute__ ((vector_size (32)));
v8si tmp = foo;   // may need a cast to convert from __m256i
int element_i = tmp[i];

__m256i in gcc/clang is defined as a vector of long long elements, so if you index it directly with [], you'll get qword elements. (And your code won't compile with MSVC, which doesn't define __m256i that way at all.)


I haven't checked the asm for any of these recently: if you care about efficiency, you might want to manually design a shuffle using your runtime-variable index, like @Wim's answer suggests that clang does.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Related: [print a \_\_m128i variable](https://stackoverflow.com/q/13257166) shows the store/reload method with a temp array for portable runtime-variable indexing. Compilers may optimize it into a shuffle anyway, like perhaps `vmovd xmm0, edi` / `vpermd ymm2, ymm1, ymm0` / `vmovd eax,xmm2` to extract `ymm1[edi]` with 32-bit elements. Or not, in which case store-forwarding is not bad for latency, and cheap. – Peter Cordes Jul 29 '23 at 18:52
-1

The reason you had compilation trouble with GCC, is that it made a rather, umm, unusual decision when implementing AVX.

in GCC, there is a built in vector type used to implement the simd types like __m256, __m128i, etc, instead of using a union like in MSVC. This is usually not too big of a deal for compatibility.

But for some reason, when going to 256 bits, gcc implemented __m256i as 4 64-bit ints, unlike __m128i which is 4 32 bit ints. For floats, GCC did the right thing and made __256 be 8 32-bit floats. Portable AVX SIMD wrappers that work with GCC+MSVC will have a fair amount of casting and hackery in the integer ops because of this :-(

chris green
  • 859
  • 6
  • 3
  • GCC's current definition for `__m128i` is 2x 64-bit, `typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));` in `emmintrin.h`, and hasn't changed for years. Did some much older version of GCC implement `__m128i` as a vector of `int` or `int32_t`? Or `long` on 32-bit systems? https://github.com/gcc-mirror/gcc/blob/30fb3231107d372c2e9df88e18714baae783870e/gcc/config/i386/emmintrin.h from 2003 shows `__m128i` defined as `__v2di` (vector of two double-width integers, i.e. 2x 64-bit since single width `int` is 32-bit on x86.) That's the oldest in that repo. – Peter Cordes Jul 12 '23 at 15:42
  • If you need to access single elements of a vector, the portable way is to store to a tmp array, as shown in my answers. Some compilers will optimize that to a shuffle. A runtime-variable index is inherently hard to deal with, given the limited choices of shuffles that exist. – Peter Cordes Jul 12 '23 at 15:45
  • Hmm you are right, my code is also full of casts for m128i also. The big issue is that msvc, _and_ the intel docs don't do it as 64 bit elements. Plus 32 bit ones are infinitely more common (at least in my game code). – chris green Jul 29 '23 at 16:10
  • The Intel intrinsics docs don't document any operator to index a `__m128i` directly, only intrinsic functions. The internal definition is opaque as far as the portably-documented intrinsic API is concerned. – Peter Cordes Jul 29 '23 at 19:04