6

How do I write a portable GNU C builtin vectors version of this, which doesn't depend on the x86 set1 intrinsic?

typedef uint16_t v8su __attribute__((vector_size(16)));

v8su set1_u16_x86(uint16_t scalar) {
    return (v8su)_mm_set1_epi16(scalar);   // cast needed for gcc
}

Surely there must be a better way than

v8su set1_u16(uint16_t s) {
    return (v8su){s,s,s,s,  s,s,s,s};
}

I don't want to write an AVX2 version of that for broadcasting a single byte!

Even a gcc-only or clang-only answer to this part would be interesting, for cases where you want to assign to a variable instead of only using as an operand to a binary operator (which works well with gcc, see below).


If I want to use a broadcast-scalar as one operand of a binary operator, this works with gcc (as documented in the manual), but not with clang:

v8su vecdiv10(v8su v) { return v / 10; }   // doesn't compile with clang

With clang, if I'm targeting only x86 and just using native vector syntax to get the compiler to generate modular multiplicative inverse constants and instructions for me, I can write:

v8su vecdiv_set1(v8su v) {
    return v / (v8su)_mm_set1_epi16(10);   // gcc needs the cast
}

But then I have to change the intrinsic if I widen the vector (to _mm256_set1_epi16), instead of converting the whole code to AVX2 by changing to vector_size(32) in one place (for pure-vertical SIMD that doesn't need shuffling). It also defeats part of the purpose of native vectors, since that won't compile for ARM or any non-x86 target.

The ugly cast is required because gcc, unlike clang, doesn't consider v8us {aka __vector(8) short unsigned int} compatible with __m128i {aka __vector(2) long long int}.

BTW, all of this compiles to good asm with gcc and clang (see it on Godbolt). This is just a question of how to write elegantly, with readable syntax that doesn't repeat the scalar N times. e.g. v / 10 is compact enough that there's no need to even put it in its own function.

Compiling efficiently with ICC is a bonus, but not required. GNU C native vectors are clearly an afterthought for ICC, and even simple stuff like this doesn't compile efficiently. set1_u16 compiles to 8 scalar stores and a vector load, instead of MOVD / VPBROADCASTW (with -xHOST enabled, because it doesn't recognize -march=haswell, but Godbolt runs on a server with AVX2 support). Purely casting the results of _mm_ intrinsics is ok, but the division calls an SVML function!

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • You can't reasonably use gcc vector intrinsics in clang *anyway*, since they oh-so-wisely decided to implement totally different `__bultin_shuffle()` semantics. – EOF Nov 21 '16 at 22:52
  • 1
    I just found some old code I wrote where I worked around the missing gcc vector intrinsic broadcast by doing `vectype v = {0}; v += scalartype;`. gcc optimizes this to a broadcast. It's not pretty (because it can't be `const`), but it's fairly short. – EOF Nov 23 '16 at 18:46

1 Answers1

4

A generic broadcast solution can be found for GCC and Clang using two observations

  1. Clang's OpenCL vector extensions and GCC's vector extensions support scalar - vector operations.
  2. x - 0 = x (but x + 0 does not work due to signed zero).

Here is a solution for a vector of four floats.

#if defined (__clang__)
typedef float v4sf __attribute__((ext_vector_type(4)));
#else
typedef float v4sf __attribute__ ((vector_size (16)));
#endif

v4sf broadcast4f(float x) {
  return x - (v4sf){};
}

https://godbolt.org/g/PXr3Xb

The same generic solution can be used for different vectors. Here is an example for a vector of eight unsigned shorts.

#if defined (__clang__)
typedef unsigned short v8su __attribute__((ext_vector_type(8)));
#else
typedef unsigned short v8su __attribute__((vector_size(16)));
#endif

v8su broadcast8us(short x) {
  return x - (v8su){};
}

ICC (17) supports a subset of the GCC vector extensions but does not support either vector + scalar or vector*scalar yet so intrinsics are still necessary for broadcasts. MSVC does not support any vector extensions.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    Adding to zero can work well for integer, but not for float without `-ffast-math`. Signed-zero behaviour (and possibly raising exceptions) means`x + 0.0` can't be optimized to `x`, so the non-clang ifdef branch (`zero + x`) doesn't optimize away: [gcc has a `vaddss` from a scalar constant before the shuffle](https://godbolt.org/g/US7TSZ). – Peter Cordes May 05 '17 at 23:27
  • 1
    Upvoted because this answer works for clang (avoids adding to zero), but I don't think I'm ready to accept this answer. (AFAIK, there *isn't* an answer other than manually typing out the same variable up to 8 times in an initializer, for v8sf. Presumably integer types will still optimize `0 + x` to `x` (like EOF mentions in a comment), so we can use this for integer vectors and avoid typing it out 32 times for v32sc). – Peter Cordes May 05 '17 at 23:31
  • 1
    @PeterCordes, I just fixed it for GCC! Instead of doing `zero + x` you can do `one*x`! https://godbolt.org/g/1a39Kv. See my updated answer. – Z boson May 08 '17 at 13:31
  • 1
    Nice, that's exactly the kind of trick that answers this question. You should make that version the first version in your answer, since there's no downside. You can initialize `one` without typing out a comma-separate list of ones by doing `v4sf one = ((v4sf){}) + 1.0;`. 0.0 + 1.0 does optimize away at compile time because both operands are constants. https://godbolt.org/g/Xek2hV. This can get rid of gcc vs. clang `#ifdef`s other than in the `typedef`, but it seems ICC still needs custom `_mm_set1_...` intrinsics. – Peter Cordes May 09 '17 at 06:26
  • So I think you could make a generic broadcast macro or template for gcc/clang (which accepts a typename and a scalar, and returns the scalar broadcast to a 16-byte vector), but not for ICC. – Peter Cordes May 09 '17 at 06:28
  • I made a nice table at the end of [this answer](http://stackoverflow.com/a/43778723/2542702) that may interest you. – Z boson May 09 '17 at 10:17
  • 1
    @PeterCordes. A simpler solution `x - (v4sf){}`. https://godbolt.org/g/2NHhAo . See [this answer](http://stackoverflow.com/a/21732581/2542702). Apparently `x - 0` is okay as well. It's just `x + 0` that is the problem. https://godbolt.org/g/nXfIoH – Z boson May 10 '17 at 09:17
  • 1
    @PeterCordes apparently `((v4sf){} + 1)*x;` may still be useful if `-frounding-math` is used https://stackoverflow.com/questions/21727331/implict-simd-sse-avx-broadcasts-with-gcc/21732581?noredirect=1#comment74815992_21732581 – Z boson May 10 '17 at 12:10