Intel's Intrinsic Guide says:
__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)
And:
Description
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 3 i := j*32 dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale] ENDFOR dst[MAX:128] := 0
If I am parsing things correctly then vindex
(with scale
) are the indexes into base_addr
used to create the __m128i result
.
Below I am trying to create val = arr[1] << 96 | arr[5] << 64 | arr[9] << 32 | arr[13] << 0
. That is, starting at 1 take every 4th element.
$ cat -n gather.cxx
1 #include <immintrin.h>
2 typedef unsigned int u32;
3 int main(int argc, char* argv[])
4 {
5 u32 arr[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
6 __m128i idx = _mm_set_epi32(1,5,9,13);
7 __m128i val = _mm_i32gather_epi32(arr, idx, 1);
8 return 0;
9 }
But when I examine val
:
(gdb) n
6 __m128i idx = _mm_set_epi32(1,5,9,13);
(gdb) n
7 __m128i val = _mm_i32gather_epi32(arr, idx, 1);
(gdb) n
8 return 0;
(gdb) p val
$1 = {0x300000004000000, 0x100000002000000}
It appears I am using vindex
incorrectly. It appears I am selecting indices 1,2,3,4
.
How do I use vindex
and scale
to select array indices 1,5,9,13
?