How to gather arbitrary indexes in VCL with AVX2 enabled

Question

I want to vectorize following code using gather instructions in VCL. Some operations should be performed on the indexes of vSource defined by other vector VInd:

vector<int> vSource;
vector<int> vInd;
for (auto i = 0; i < vSource.size();i++) {
    vSource[ vInd[i] ]; //some work
}

vInd contains completely random indexes, so I cannot shuffle them or do other cheap workaround. Desired output example:

vector<int> vSource = {1,2,3,4,5,6,7,8,9,10,11,12,13};
vector<int> vInd = {2,1,5,3,10,5,8,2,10,2,5,3};
3   2   6   4   11  6   9   3

I can vectorize my code using AVX2.

void intrinGather(vector<int> & vSource, vector <int> & vInd) {
    __m256i ind = _mm256_loadu_si256((__m256i*) & vInd[0]);
    __m256i vec = _mm256_i32gather_epi32(&vSource[0], ind, 4);
}

However VCL version compiles only if I use compile-time indexes. How to pass arbitrary indexes as a parameter to VCL?

void VCLGather(vector<int> & vSource, vector<int> ind) {
    Vec8i vec;
    vec=gather8i<2,1,5,3,10,5,8,2>(&vSource[0]); //compiles
    //vec=gather8i<ind[0],ind[3],ind[2],ind[10],ind[6],ind[8],ind[7],ind[1]>(&vSource[0]); //doesn't compile
}

I'm perfectly fine with intrinGather function, but want to keep code in the same VCL-using style and features like multi-architecture code. Is it possible?

Source code from question

Don't put answers in the question. Post them as an answer (click "answer my own question") — Peter Cordes, Apr 08 '23 at 02:32

score 2 · Accepted Answer · answered Apr 08 '23 at 06:18

The VCL template function lookup<n>(index, table) is indeed intended for this purpose.

VCL will search for the optimal implementation of your function. It will use a permute instruction rather than a gather instruction if n is not too big, because permute instructions are much faster than gather instructions. The n parameter is added in order to enable this optimization

The lookup<n> templates are limiting each index to the interval 0 ≤ i < n for security reasons. If you don't want this security then you may set n = INT_MAX. I will change the code to make sure the interval check is optimized away in this case.

Peter Cordes · Answer 2 · 2023-03-27T18:35:17.437

VCL types can implicitly convert to/from __m256i (thanks to overloaded cast operators), so you can just use _mm256_i32gather_epi32.

Since you know you have run-time variable indices, you know they can't be template parameters; that template is I think for letting template metaprogramming optimize a fixed gather into maybe some loads + shuffles, e.g. if multiple elements come from near each other.

If you search for gather in https://github.com/vectorclass/version2/blob/master/vectori256.h, you'll find that there's a wrapper function template<int n> Vec8i lookup(Vec8i const index, void const * table), but that tries to emulate shuffles which just use the low few bits of the index: it clamps or modulos the vector of indices before using it with _mm256_i32gather_epi32.

And the template functions you found for fixed indices, like gather8i.

So there don't appear to be any wrappers for just _mm256_i32gather_epi32. That's not surprising, VCL isn't trying to hide the Intel intrinsics, just add convenience on top of them, like operator overloads. When a raw intrinsic does exactly what you want, just use it, especially if a quick search of the header file doesn't find another function that uses it without stuff you don't want.

If you want to write code that's adaptable to different vector widths the way you can with VCL wrapper functions and operators, you could write your own overloaded wrappers.

#include <immintrin.h>
#include <vectorclass.h>

// works with GCC with -O2 or higher.
// clang, or gcc -O0, would need hard-coded or template-parameter scale

#ifdef __AVX512F__
// VCL should define Vec16i if AVX-512 is available.
inline __attribute__((always_inline))  // because scale needs to be a compile-time constant
Vec16i vpgatherdd(Vec16i idx, const void *base, int scale){
   // __m512i version, intrinsic takes void* and this arg order
   return _mm512_i32gather_epi32(idx, base, scale);
}
#endif

// AVX2
inline __attribute__((always_inline))
Vec8i vpgatherdd(Vec8i idx, const void *base, int scale){
   // __m256i version introduced with AVX2, intrinsic takes int* and other arg order
   return _mm256_i32gather_epi32((const int*)base, idx, scale);
}

inline __attribute__((always_inline))
Vec4i vpgatherdd(Vec4i idx, const void *base, int scale){
   // __m128i version, same as __m256i version
   return _mm_i32gather_epi32((const int*)base, idx, scale);
}

If you always use it with scale=4 you might omit that function arg and hard-code it into the definition, like I did on Godbolt to check that this would compile. (scale has to be an immediate, so a constant expression for the intrinsic, at least after inlining + constant propagation with optimization enabled. GCC allows this, but clang still complains even with optimization enabled, so you'd have to use a template parameter, perhaps with a default of 4. Or of course just hard-code the 4 into the wrapper functions if you don't need to use it any other way.)

Taking void* for the base makes it easy to use with any pointer, although you might want to take int* to prevent accidentally passing it the address of a std::vector control blocks, like &vec instead of vec.data(), especially if you fold the scale=4 into the function.

As is, this is a pure wrapper for exactly what the asm instruction can do, nothing more, nothing less, just like the intrinsic. You can use it with base=0 and scale=1 to dereference 32-bit pointers, instead of indexing an array. Or with scale=8 to grab an int from 2-element structs, or with scale=1 or 2 to do potentially unaligned loads, or use byte offsets.

(Well, the asm instruction also takes a mask, _mm256_mask_i32gather_epi32, but mostly that's about being able to make partial progress on a page fault on one element. You can of course start with a mask that's not all-ones. The instruction isn't faster in that case, so it's not great if your masks are often sparse.)

You might want to name your wrapper function something more generic that doesn't include the element size, but C++ overloads only work based on args, not return value, so a generic gather(Vec8i) function couldn't distinguish vpgatherdd from vpgatherdq using 32-bit indices to access 64-bit elements.

You could I guess template on the destination type and make template overloads, as a way to let you write code like gather<T>(vec, base, sizeof (dst[0])). Maybe you'd want to bake scale into the overloads / template specializations instead of having the caller need to come up with it.

Yes, but this way there will be a drawback. One of advantages of VCL is that it can convert one _mm512/__m512 operation/data type to two _mm256/__m256. It's harder if you're writing pure intrinsics. I also checked lookup function and they doesn't work either despite having _mm256_i32_gather_epi32 in source code. Thanks anyway. — Vladislav Kogan, Mar 24 '23 at 20:40
@VladislavKogan: Perhaps `lookup` could optimize away since the clamping would be a no-op? You'd have to check the asm; you might get correct results but inefficient asm. `lookup< 1L<<32 >` of course can't work because 2^32 doesn't fit in an `int`. You're right, using raw intrinsics does lose one of the advantages of VCL which is that you can change or template just the vector type width without changing the rest of the code. You can always write your own overloaded wrapper functions for the few intrinsics that are a problem for code reuse. — Peter Cordes, Mar 24 '23 at 21:06
@VladislavKogan: Updated with an example overloaded wrapper function. — Peter Cordes, Mar 26 '23 at 17:30
Your example with __m512i is right, however __m256i and __m128i won't compile. Gather for __m512i requires (vindex, base_addr,scale), however _mm256 gather uses (base_addr, vindex, scale) argument order instead, and same for _mm128 gather. Intel kinda weird here. Also couldn't pass scale as argument ([here](https://godbolt.org/z/5n13qoMbd)), so I completely removed it. — Vladislav Kogan, Mar 27 '23 at 17:56
@VladislavKogan: Thanks for the reminder, updated. The 128 and 256 intrinsics were new with AVX2, and used `const int*`. The 512-bit intrinsic was new with AVX-512 when Intel started using `void*` for their load/store intrinsics; IDK why they also changed the arg order. Also, `__AVX512F__` is the right macro, not `__AVX512__`, oops. :P GCC *does* allow constant-propagation after inlining to get a constant scale to the underlying `__builtin`, but only if you enable optimization. Otherwise you'd have to make it a `#define` macro like `immintrin.h` does. https://godbolt.org/z/xscPzjWxM works — Peter Cordes, Mar 27 '23 at 18:27
@VladislavKogan: But yes, if you want debug builds to be possible, you can't pass scale as an arg. GCC allows it with optimization, clang doesn't allow it even *with* optimization enabled, at least in C++. https://godbolt.org/z/sM3zGn5ET. `inline __attribute__((always_inline))` isn't enough to keep it happy. Perhaps a template parameter would be good, like `vpgatherdd` as a default. Or if you don't need that, for sure just put the constant `4` into the wrapper function. — Peter Cordes, Mar 27 '23 at 18:33
I checked performance (not generated asm), in fact lookup optimized very good and giving almost same results as inlined instrinsics (https://godbolt.org/z/GaE59b9zq). Guess (if constexpr) in source code are optimized away by the compiler. So the asm probably are efficient enough. — Vladislav Kogan, Apr 08 '23 at 15:18
@VladislavKogan: Yeah, clang optimizes away the clamping intrinsics remain. GCC doesn't, it still includes a `vpminud`. https://godbolt.org/z/W6xM48hsf (And MSVC definitely won't optimize away intrinsics). That normally won't matter when you're bottlenecked on gather throughput on an Intel CPU (where front-end and ALU ports aren't part of the bottleneck), but it's not as efficient and could be a slowdown in different code that did a lot of computation surrounding gathers. — Peter Cordes, Apr 08 '23 at 20:47

Vladislav Kogan · Answer 3 · 2023-04-08T15:16:05.323

Answer 1 (best answer)

VCL have lookup<>() function. If given INT_MAX as template parameter, it will perform almost the same as raw instrinsics. No need to reinvent the wheel.

#include <climits>
Vec8i lookingup, idx;
for (auto i = 0; i < vecsize;i+=8) {
    idx.load(&vInd[0]);
    lookingup = lookup<INT_MAX>(idx,&vSource[0]);
    lookingup.store(&vDest[i]);
}

Answer 2: write raw instrinsics function

VCL doesn't have direct equivalent or _mm256_gather. Best way is do conditionally call AVX2 intrinsic directly when AVX2 is enabled using VCL INSTRSET macro.

if (INSTRSET>=8) { //you can add _mm512 gatherer if you want as well
    intrinGather(vSource,vInd);
}
else {
    loadScalar(vSource,vInd);
}

Answer 3: write custom wrapper for VCL

You can wrap Vec16i, Vec8i and Vec4i in the same fashion. Note that Intel syntax for _mm512_gather and _mm256, _mm128 are sligtly different.

#ifdef __AVX512F__
__attribute__((always_inline))
inline Vec16i vpgatherdd(const int *base, Vec16i idx){
   return _mm512_i32gather_epi32(idx, base, sizeof(int));
}
#endif
#ifdef __AVX2__
__attribute__((always_inline))
inline Vec8i vpgatherdd(const int *base, Vec8i idx){
   return _mm256_i32gather_epi32(base, idx, sizeof(int));
}
#endif

int main()
{
    vector<int> vSource = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
    vector<int> vInd = {2,1,5,3,10,5,8,2,10,2,5,3,10,12,2,14,11,5,8};
    //Overloaded functions wrapper example
    Vec8i ind;
    ind.load(&vInd[0]);
    Vec8i gathered = vpgatherdd(&vSource[0], ind);
}

How to gather arbitrary indexes in VCL with AVX2 enabled

3 Answers3