VCL types can implicitly convert to/from __m256i
(thanks to overloaded cast operators), so you can just use _mm256_i32gather_epi32
.
Since you know you have run-time variable indices, you know they can't be template parameters; that template is I think for letting template metaprogramming optimize a fixed gather into maybe some loads + shuffles, e.g. if multiple elements come from near each other.
If you search for gather
in https://github.com/vectorclass/version2/blob/master/vectori256.h, you'll find that there's a wrapper function template<int n> Vec8i lookup(Vec8i const index, void const * table)
, but that tries to emulate shuffles which just use the low few bits of the index: it clamps or modulos the vector of indices before using it with _mm256_i32gather_epi32
.
And the template functions you found for fixed indices, like gather8i
.
So there don't appear to be any wrappers for just _mm256_i32gather_epi32
. That's not surprising, VCL isn't trying to hide the Intel intrinsics, just add convenience on top of them, like operator overloads. When a raw intrinsic does exactly what you want, just use it, especially if a quick search of the header file doesn't find another function that uses it without stuff you don't want.
If you want to write code that's adaptable to different vector widths the way you can with VCL wrapper functions and operators, you could write your own overloaded wrappers.
#include <immintrin.h>
#include <vectorclass.h>
// works with GCC with -O2 or higher.
// clang, or gcc -O0, would need hard-coded or template-parameter scale
#ifdef __AVX512F__
// VCL should define Vec16i if AVX-512 is available.
inline __attribute__((always_inline)) // because scale needs to be a compile-time constant
Vec16i vpgatherdd(Vec16i idx, const void *base, int scale){
// __m512i version, intrinsic takes void* and this arg order
return _mm512_i32gather_epi32(idx, base, scale);
}
#endif
// AVX2
inline __attribute__((always_inline))
Vec8i vpgatherdd(Vec8i idx, const void *base, int scale){
// __m256i version introduced with AVX2, intrinsic takes int* and other arg order
return _mm256_i32gather_epi32((const int*)base, idx, scale);
}
inline __attribute__((always_inline))
Vec4i vpgatherdd(Vec4i idx, const void *base, int scale){
// __m128i version, same as __m256i version
return _mm_i32gather_epi32((const int*)base, idx, scale);
}
If you always use it with scale=4
you might omit that function arg and hard-code it into the definition, like I did on Godbolt to check that this would compile. (scale
has to be an immediate, so a constant expression for the intrinsic, at least after inlining + constant propagation with optimization enabled. GCC allows this, but clang still complains even with optimization enabled, so you'd have to use a template parameter, perhaps with a default of 4
. Or of course just hard-code the 4
into the wrapper functions if you don't need to use it any other way.)
Taking void*
for the base makes it easy to use with any pointer, although you might want to take int*
to prevent accidentally passing it the address of a std::vector
control blocks, like &vec
instead of vec.data()
, especially if you fold the scale=4
into the function.
As is, this is a pure wrapper for exactly what the asm instruction can do, nothing more, nothing less, just like the intrinsic. You can use it with base=0
and scale=1
to dereference 32-bit pointers, instead of indexing an array. Or with scale=8
to grab an int
from 2-element structs, or with scale=1
or 2
to do potentially unaligned loads, or use byte offsets.
(Well, the asm instruction also takes a mask, _mm256_mask_i32gather_epi32
, but mostly that's about being able to make partial progress on a page fault on one element. You can of course start with a mask that's not all-ones. The instruction isn't faster in that case, so it's not great if your masks are often sparse.)
You might want to name your wrapper function something more generic that doesn't include the element size, but C++ overloads only work based on args, not return value, so a generic gather(Vec8i)
function couldn't distinguish vpgatherdd
from vpgatherdq
using 32-bit indices to access 64-bit elements.
You could I guess template on the destination type and make template overloads, as a way to let you write code like gather<T>(vec, base, sizeof (dst[0]))
. Maybe you'd want to bake scale
into the overloads / template specializations instead of having the caller need to come up with it.