4

I'm looking for a way to overload operator[] (within a broader SIMD class) to facilitate reading and writing individual elements within a SIMD word (e.g. __m512i). A couple constraints:

  • Compliant with C++11 (or later)
  • Compatible with additional intrinsics based code
  • Not OpenCL/SYCL (which I could, but I can't *sigh*)
  • Mostly portable across g++, icpc, clang++
  • Preferably applicable to other SIMD beyond Intel (ARM, IBM, etc...)
  • (edit) Performance isn't really an issue (not generally used in places where performance matters)

(This rules out things like type punning through pointer casting, and GCC vector types.)

Based heavily on Scott Meyers' "More Effective C++" (Item 30), and other code I've come up with the following MVC code that seems "right", that seems to work, but also seems over complicated. (The "proxy" approach is meant to deal with the left/right hand operator[] usage, and the "memcpy" is meant to deal with the type punning/C++ standard issue.)

I'm wonder if someone has a better solution (and can explain it so I learn something ;^))

#include <iostream>
#include <cstring>
#include "immintrin.h"

using T = __m256i;           // SIMD type
using Te = unsigned int;     // SIMD element type

class SIMD {

    class SIMDProxy;

  public :

    const SIMDProxy operator[](int index) const {
      std::cout << "SIMD::operator[] const" << std::endl;
      return SIMDProxy(const_cast<SIMD&>(*this), index);
    }
    SIMDProxy operator[](int index){
      std::cout << "SIMD::operator[]" << std::endl;
      return SIMDProxy(*this, index);
    }
    Te get(int index) {
      std::cout << "SIMD::get" << std::endl;
      alignas(T) Te tmp[8];
      std::memcpy(tmp, &value, sizeof(T));  // _mm256_store_si256(reinterpret_cast<__m256i *>(tmp), c.value);
      return tmp[index];
    }
    void set(int index, Te x) {
      std::cout << "SIMD::set" << std::endl;
      alignas(T) Te tmp[8];
      std::memcpy(tmp, &value, sizeof(T));  // _mm256_store_si256(reinterpret_cast<__m256i *>(tmp), c.value);
      tmp[index] = x;
      std::memcpy(&value, tmp, sizeof(T));  // c.value = _mm256_load_si256(reinterpret_cast<__m256i const *>(tmp));
    }

    void splat(Te x) {
      alignas(T) Te tmp[8];
      std::memcpy(tmp, &value, sizeof(T));
      for (int i=0; i<8; i++) tmp[i] = x;
      std::memcpy(&value, tmp, sizeof(T));
    }
    void print() {
      alignas(T) Te tmp[8];
      std::memcpy(tmp, &value, sizeof(T));
      for (int i=0; i<8; i++) std::cout << tmp[i] << " ";
      std::cout << std::endl;
    }

  protected :

  private :

    T value;

    class SIMDProxy {
      public :
        SIMDProxy(SIMD & c_, int index_) : c(c_), index(index_) {};
        // lvalue access
        SIMDProxy& operator=(const SIMDProxy& rhs) {
          std::cout << "SIMDProxy::=SIMDProxy" << std::endl;
          c.set(rhs.index, rhs.c.get(rhs.index));
          return *this;
        }
        SIMDProxy& operator=(Te x) {
          std::cout << "SIMDProxy::=T" << std::endl;
          c.set(index,x);
          return *this;
        }
        // rvalue access
        operator Te() const {
          std::cout << "SIMDProxy::()" << std::endl;
          return c.get(index);
        }
      private:
        SIMD& c;       // SIMD this proxy refers to
        int index;      // index of element we want
    };
    friend class SIMDProxy;   // give SIMDProxy access into SIMD


};

/** a little main to exercise things **/
int
main(int argc, char *argv[])
{

  SIMD x, y;
  Te a = 3;

  x.splat(1);
  x.print();

  y.splat(2);
  y.print();

  x[0] = a;
  x.print();

  y[1] = a;
  y.print();

  x[1] = y[1]; 
  x.print();
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
justapony
  • 129
  • 8
  • I had considered an (unnamed?) union with a std::array. – Surt Oct 09 '20 at 15:27
  • 1
    It's my understanding that the pointer aliasing and union approaches, while they may often work, and _are_ valid for C are actually not valid under the C++ standard. (Hence my question). I'm happy if someone wants to prove me wrong. – justapony Oct 09 '20 at 15:47
  • Since vector types are not strictly compliant to C or C++, you are bound to use some sort of a non-standard compiler extension. For example, you can't load or store integer vectors without a `reinterpret_cast`, which is another way of type punning. Further, if you are concerned with performance (and I assume you are, given that you're writing SIMD code), you'd better use target-specific intrinsics to perform the required operations efficiently. – Andrey Semashev Oct 09 '20 at 16:05
  • Lastly, specifically for x86 SSE/AVX, there is no instruction to extract or insert arbitrary element of a vector. There are `pextrw`/`pinsrw` and similar instructions, but (a) they are not available for all element sizes and for some sizes require more advanced ISA extensions and (b) they require a constant index of the element (i.e. you can't use a runtime value to specify the index). There are workarounds, but in general this would require a sequence of intrinsics/instructions, which may be inefficient, depending on the particular use case where you need it. – Andrey Semashev Oct 09 '20 at 16:10
  • Bottom line, avoid extracting/inserting individual elements, if at all possible. In the generic library you write, consider *not* providing these operations to avoid encouraging users to use constructs involving these operations. If you must, provide them with a big warning about inefficiencies and also provide more performant alternatives for common use cases. – Andrey Semashev Oct 09 '20 at 16:13
  • 1
    @AndreySemashev “For example, you can't load or store integer vectors without a reinterpret_cast, which is another way of type punning.” — No, you can `memcpy`, just as OP does. This is always well-defined, and compilers optimise it out. Furthermore, there’s a (vast) difference between UB and implementation-defined. And while you can’t avoid the latter with SIMD, you can mostly avoid the former. The one exception is when aliasing into an array, which is currently a standard defect that’s scheduled to be fixed in the next revision. – Konrad Rudolph Oct 09 '20 at 17:00
  • 1
    @KonradRudolph Ok, I phrased that sentence inaccurately, and indeed you can use `memcpy`. However, the intended way to do this is to use intrinsics, like `_mm_loadu_si128`/`_mm_storeu_si128`, and these normally require `reinterpret_cast`. What allows this to work is the fact that compilers allow type aliasing with vector types and scalar types. This is clearly not pure C/C++, which is why I said you have to rely on some compiler extensions anyway, and I stand by this. – Andrey Semashev Oct 09 '20 at 17:35
  • 1
    You may find [this GitHub project](https://github.com/microsoft/DirectXMath) a useful reference. – Chuck Walbourn Oct 09 '20 at 19:16
  • @ChuckWalbourn: Agner Fog's VCL [(github)](https://github.com/vectorclass/version2) is now Apache licensed (formerly GPL), and has similar `operator[]` overloads. – Peter Cordes Oct 10 '20 at 01:26
  • @AndreySemashev: fun fact: AVX512 intrinsics finally switched to `void*` for load/store. (Maybe only for 512-bit vectors, I forget). But finally no more ridiculous casts to `__mXXXi*`, for some types at least. – Peter Cordes Oct 10 '20 at 01:30
  • @KonradRudolph and AndreySemashev: `__m128i*` is a may-alias type. It's well-defined to point it at anything else, just like with `char*`, but it's still undefined to cast an `uint32_t*` to point into a `__m128i` object. (Except with MSVC). Compilers that support Intel intrinsics support union type-punning but *not* pointer-cast type punning. See [print a \_\_m128i variable](https://stackoverflow.com/a/46752535) and [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605) – Peter Cordes Oct 10 '20 at 01:31
  • @justapony: there's no reason to use `memcpy` here, just use `_mm256_store_si256` which is more idiomatic for SIMD vectors. Although really compilers can optimize either way into a `vpextrd` or whatever, especially if you only access a single element. Don't manually `splat`, though; use `_mm_set1_epi32(v[0])` or whatever, or ideally a `_mm_shuffle_epi32` or `_mm256_`something. I'm a lot less confident that compilers will optimize that 8x `memcpy` in a loop into `vpbroadcastd ymm0, xmm0` (when AVX2 is available). Even extracting to scalar and then `set1` in the C++ source isn't great. :/ – Peter Cordes Oct 10 '20 at 01:42
  • @PeterCordes > Compilers that support Intel intrinsics support union type-punning but not pointer-cast type punning. -- The compilers in question support union type punning as an extension, and do so regardless whether the involved types are allowed to alias. The types that are allowed to alias other types (including `char` and vector types) can participate in any language constructs that result in aliasing, including casts. The two features are independent. So casting `uint32_t*` to `__m128i*` and loading/storing through the pointer is supported via the latter compiler extension. – Andrey Semashev Oct 10 '20 at 07:49
  • @AndreySemashev: Yes, that's what I said. Union type-punning is safe in general on compilers that support Intel intrinsics, as per C99 and as an extension to C++. So is pointing a `__m128i*` at an array of `uint32_t[]`. But type-punning via pointer-casting in general is *not* safe in GNU C (gcc/clang), only MSVC. e.g. the behaviour of `__m128i v;` / `*(int*)&v;` is *not* defined by GCC. So you can't safely use it to implement `operator[]`. Just like it's not safe to point `int*` at a `char[]`. See the linked answers in my previous comment. Only `may_alias` types can be used like `char*`. – Peter Cordes Oct 10 '20 at 07:55
  • @AndreySemashev: It seemed like your comments were implying that you could do `(int*)&vec` as part of implementing `operator[]`. If that wasn't what you meant, then we fully agree on everything. But if not, remember `int` isn't a may-alias type. You could `typedef int aliasing_int __attribute((may_alias))` and cast to `(aliasing_int*)`... (Also fun fact, some scalar FP load/store intrinsics also treat their `float*` or `double*` arg as a may-alias and maybe unaligned pointer.) – Peter Cordes Oct 10 '20 at 07:57
  • @PeterCordes I was referring to the case when you have to `reinterpret_cast` a pointer to a scalar type to `__m128i*` and then load/store through the resulting pointer. It seemed like you were saying that this is not allowed, and I objected. If I misunderstood you, then sorry, and we are probably in agreement. :) – Andrey Semashev Oct 10 '20 at 10:41
  • @AndreySemashev: Ok, then yes we agree. I think it was misleading to point that out in this case, because you were replying to the OP's comment that *It's my understanding that the pointer aliasing and union approaches, while they may often work, and are valid for C are actually not valid under the C++ standard.* - I thought it was clear they're talking about the kind of pointer-casting you'd use to implement element access, not to do full vector loads from arrays. (Also, no they're not valid for C; C and C++ strict-aliasing pointers rules are the same. ISO C does allow union type punning.) – Peter Cordes Oct 10 '20 at 10:45
  • @AndreySemashev: So it only helps in terms of being able to use `_mm_store_si128((__m128*)tmp, value)` to an int array instead of memcpy. Or using a union with a `__m128i` and an array, which unlike arbitrary pointer-casting *is* safe in compilers that support intrinsics. But that's not what you said. The rules around what's safely portable vs. what isn't are somewhat obscure, so it's important to be precise here about what's safe vs. not. – Peter Cordes Oct 10 '20 at 10:49
  • @PeterCordes > Or using a union with a `__m128i` and an array, which unlike arbitrary pointer-casting is safe in compilers that support intrinsics. -- This is another point where I don't quite agree, as this implies that the support for intrinsics is the indication for support for union-based type punning. It is not, even if all current compilers (probably) support both features. The support for union-based type punning is a separate extension, which may or may not be supported by a compiler that supports intrinsics. One generally has to consult with the compiler docs to know that. – Andrey Semashev Oct 10 '20 at 10:59
  • @AndreySemashev: Ok yes, that's a good point. I don't know if there are any less-mainstream compilers, like maybe SunCC or whatever other x86 compilers that support intrinsics but not (in C++ mode) union type punning. It's not 100% implied by the interface. (MSVC's implementation does use unions, and some people write code that uses `vec.f32[]` or whatever. And any compiler that supports the GNU C++ dialect of C++ must support it. So in practice that covers almost everything, even though it's not because of intrinsics support.) – Peter Cordes Oct 10 '20 at 11:27
  • I think your approach is reasonable, especially when you don't care too much about performance. I do it this way. It's simple, generalizes to types of any width, works the same across vector ISAs (since _every_ ISA is going to have loads and stores but they might vary a lot in the shuffles they offer). Performance isn't even _that_ bad. – BeeOnRope Oct 11 '20 at 00:17

3 Answers3

3

Your code is very inefficient. Normally these SIMD types are not present anywhere in memory, they are hardware registers, they don’t have addresses and you can’t pass them to memcpy(). Compilers pretend very hard they’re normal variables that’s why your code compiles and probably works, but it’s slow, you’re doing roundtrips from registers to memory and back all the time.

Here’s how I would do that, assuming AVX2 and integer lanes.

class SimdVector
{
    __m256i val;

    alignas( 64 ) static const std::array<int, 8 + 7> s_blendMaskSource;

public:

    int operator[]( size_t lane ) const
    {
        assert( lane < 8 );
        // Move lane index into lowest lane of vector register
        const __m128i shuff = _mm_cvtsi32_si128( (int)lane );
        // Permute the vector so the lane we need is moved to the lowest lane
        // _mm256_castsi128_si256 says "the upper 128 bits of the result are undefined",
        // and we don't care indeed.
        const __m256i tmp = _mm256_permutevar8x32_epi32( val, _mm256_castsi128_si256( shuff ) );
        // Return the lowest lane of the result
        return _mm_cvtsi128_si32( _mm256_castsi256_si128( tmp ) );
    }

    void setLane( size_t lane, int value )
    {
        assert( lane < 8 );
        // Load the blending mask
        const int* const maskLoadPointer = s_blendMaskSource.data() + 7 - lane;
        const __m256i mask = _mm256_loadu_si256( ( const __m256i* )maskLoadPointer );
        // Broadcast the source value into all lanes.
        // The compiler will do equivalent of _mm_cvtsi32_si128 + _mm256_broadcastd_epi32
        const __m256i broadcasted = _mm256_set1_epi32( value );
        // Use vector blending instruction to set the desired lane
        val = _mm256_blendv_epi8( val, broadcasted, mask );
    }

    template<size_t lane>
    int getLane() const
    {
        static_assert( lane < 8 );
        // That thing is not an instruction;
        // compilers emit different ones based on the index
        return _mm256_extract_epi32( val, (int)lane );
    }

    template<size_t lane>
    void setLane( int value )
    {
        static_assert( lane < 8 );
        val = _mm256_insert_epi32( val, value, (int)lane );
    }
};

// Align by 64 bytes to guarantee it's contained within a cache line
alignas( 64 ) const std::array<int, 8 + 7> SimdVector::s_blendMaskSource
{
    0, 0, 0, 0, 0, 0, 0, -1,  0, 0, 0, 0, 0, 0, 0
};

For ARM it’s different. If lane index is known at compile time, see vgetq_lane_s32 and vsetq_lane_s32 intrinsics.

For setting lanes on ARM you can use the same broadcast + blend trick. Broadcast is vdupq_n_s32. An approximate equivalent of vector blend is vbslq_s32, it handles every bit independently, but for this use case it’s equally suitable because -1 has all 32 bits set.

For extracting either write a switch, or store the complete vector into memory, not sure which of these two is more efficient.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 1
    Upvoted, but the OP did mention they wanted slow but portable, presumably for use-cases like debug-prints. Of course in practice someone's going to write code that uses per-element access, maybe even in a loop (in which case storing to memory is typically worth it). Also, if the element index is a compile-time constant, your way could end up worse, defeating optimization into `vpbroadcastd` / `vpblendd` for insert. (Unfortunately there is no `vpinsrd ymm, ymm, r/m32, imm`, only an xmm destination version that would zero the upper lane of an `__m256i`) – Peter Cordes Oct 10 '20 at 12:33
  • @PeterCordes Intrinsics are very portable across compilers. Added methods for indices known at compile-time. – Soonts Oct 10 '20 at 12:48
  • They meant portable across ISAs. *Preferably applicable to other SIMD beyond Intel (ARM, IBM, etc...)*. Note that with GNU C++, you can maybe use `if(__builtin_constant_p(lane) return getlane

    ();`. (The arg might have to be declared `constexpr` for you to use it as a template param when it is a constant.)

    – Peter Cordes Oct 10 '20 at 12:52
  • @PeterCordes Added a note about NEON. Not sure what they meant by IBM, VMX128 from 2013? Anyway, I don’t have any experience with that, only programmed SIMD on Intel/AMD and ARM. – Soonts Oct 10 '20 at 13:06
  • I assume IBM = PowerPC / POWER SIMD, i.e. AltiVec (https://en.wikipedia.org/wiki/AltiVec), which is still part of current POWER architectures. https://wiki.raptorcs.com/wiki/Power_ISA/Vector_Operations. Ok yes, a 2020 doc mentions alternate names, apparently VMX is the "most official" name, but Altivec is still used. http://cdn.openpowerfoundation.org/wp-content/uploads/resources/Intrinsics-Reference/Intrinsics-Reference-20200520.pdf – Peter Cordes Oct 10 '20 at 13:14
1

Of the original approaches (memcpy, intrinsic load/store), and the additional suggestions (user defined union-punning, user defined vector type) it seems like the intrinsic approach may have a small advantage. This is based on some quick examples I attempted to code up in Godbolt (https://godbolt.org/z/5zdbKe).

The "best" for writing to an element looks something like this.

__m256i foo2(__m256i x, unsigned int a, int index)
{
    alignas(__m256i) unsigned int tmp[8];
    _mm256_store_si256(reinterpret_cast<__m256i *>(tmp), x);
    tmp[index] = a;
    __m256i z = _mm256_load_si256(reinterpret_cast<__m256i const *>(tmp));
    return z;
}
justapony
  • 129
  • 8
  • Posting an actual answer to your own question is good, and explicitly encouraged. But it should focus on being an answer, not a reply to other answers. (Remove the first couple paragraphs of editorializing / chat, or if you want to keep any of that, move it to the bottom.) Also better if you actually include a code block of part of the C++ and corresponding, not *just* the Godbolt link. The Godbolt link can have more code you didn't include, but including at least an example of the way you find best would be good. – Peter Cordes Oct 12 '20 at 21:30
  • And yes, using intrinsic load/store is idiomatic for dealing with intrinsics. In theory memcpy should be just as efficient, but not shocked if there are cases where there's a missed optimization. `foo1` is surprisingly bad in a way I didn't expect, though, and `-march=haswell` doesn't help ([related](https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd) re: split unaligned load/store, but yours are aligned). Possibly worth trying a 4-byte `memcpy` to a byte offset inside the vector, to replace one element *with* memcpy, which is aliasing safe. – Peter Cordes Oct 12 '20 at 21:34
  • `bit_cast` is just type-pun for values (instead of reinterpret cast for pointers); you could bit-cast to `unsigned __int128` and right-shift, but otherwise it's no help with element access. – Peter Cordes Oct 12 '20 at 21:35
0

If you only care about g++/clang++/icc compatibility, you can just use the __attribute__ which these compilers use internally to define their intrinsic instructions:

typedef int32_t int32x16_t __attribute__((vector_size(16*sizeof(int32_t)))) __attribute__((aligned(16*sizeof(int32_t))));

When it makes sense (and is possible on the given architecture), variables will be stored in vector registers. Also, the compilers provide a read/writeable operator[] for this typedef (which should get optimized, if the index is known at compile-time).

chtz
  • 17,329
  • 4
  • 26
  • 56
  • Normally you want to define vectors of fixed byte-width, like `vector_size(64)` for the 16x 4-byte AVX-512 vector you're defining. Also, `aligned()` = size is implicit, and you can put multiple attributes in one comma-separated list. Like `__attribute__((vector_size(16), aligned(1)))` to make an unaligned version that can also alias anything, like char* can. (like GCC uses for `_mm_loadu_si128`) – Peter Cordes Oct 10 '20 at 13:56