How to use vindex and scale with _mm_i32gather_epi32 to gather elements?

Question

__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

And:

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation
FOR j := 0 to 3
  i := j*32
  dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
ENDFOR
dst[MAX:128] := 0

If I am parsing things correctly then vindex (with scale) are the indexes into base_addr used to create the __m128i result.

Below I am trying to create val = arr[1] << 96 | arr[5] << 64 | arr[9] << 32 | arr[13] << 0. That is, starting at 1 take every 4th element.

$ cat -n gather.cxx
 1  #include <immintrin.h>
 2  typedef unsigned int u32;
 3  int main(int argc, char* argv[])
 4  {
 5          u32 arr[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
 6          __m128i idx = _mm_set_epi32(1,5,9,13);
 7          __m128i val = _mm_i32gather_epi32(arr, idx, 1);
 8          return 0;
 9   }

But when I examine val:

(gdb) n
6               __m128i idx = _mm_set_epi32(1,5,9,13);
(gdb) n
7               __m128i val = _mm_i32gather_epi32(arr, idx, 1);
(gdb) n
8               return 0;
(gdb) p val
$1 = {0x300000004000000, 0x100000002000000}

It appears I am using vindex incorrectly. It appears I am selecting indices 1,2,3,4.

How do I use vindex and scale to select array indices 1,5,9,13?

off topic: `uint32_t` would be one less non-standard thing for people to check that you're not doing wrong when reading your question. Also, I can't trivially copy/paste this code to try it because you only have a version with line numbers. I guess I can use `cut -b 4- > foo.c`. — Peter Cordes, Jun 15 '18 at 23:36
@PeterCordes - If you are having that many problems then maybe you should skip this question. — jww, Jun 15 '18 at 23:44
If I had, you wouldn't already have a working answer. When asking questions, it's a good idea to make it as easy as possible for people to try your code if they want to. (The answer didn't click for me until I single-stepped the asm and saw the same result you did, then printed the index register.) — Peter Cordes, Jun 15 '18 at 23:45

Peter Cordes · Answer 1 · 2018-06-16T03:22:14.273

2

Your array elements are 4 bytes wide. Therefore use a scale factor of 4 in the VSIB addressing mode when using element indices instead of byte offsets.

The int const* base_addr argument has type int, but at no point is any C pointer math done with it. It's fed directly to the asm instruction, so you need to take care of byte offsets. (And hopefully also taking care of strict aliasing in case you want to grab dwords out of a uint64_t[] or char[].) It could just as well be a const void*.

If the intrinsic multiplied your scale factor by 4, you wouldn't be able to use it with byte offsets, only with int indices. The asm instruction can scale by 1,2,4, or 8, using the usual x86 addressing mode encoding: a 2 bit shift count.

A strided index with a stride of 4, starting at 1, gets zeros everywhere except the high byte of each element. i.e. it's offset by 1 byte from the the start of the array, and x86 is little endian.

Notice that you didn't get 1,2,3,4, you got 1<<24, 2<<24, etc. Printing as one big 64-bit integer makes that harder to spot.

With that source change of scale = 1 -> 4, your gather is an identity mapping:

(gdb) p  $xmm7.v4_int32
$2 = {13, 9, 5, 1}

I'm not sure if GDB has a convenient way to print the elements of a __m128i variable without knowing what register it's in.

edited Jun 16 '18 at 03:22

answered Jun 15 '18 at 23:42

Peter Cordes

328,167
45
605
847

*"I'm not sure if GDB has a convenient way to print the elements of a __m128i..."* - I think the underlying representation is all gdb knows. For example, `ptype __m128i` returns `type = long long __attribute__ ((vector_size(2)))`. – jww Jun 16 '18 at 00:27
Things are not making sense. It seems like the documentation is wrong. There is no way `base_addr[1]`, `base_addr[5]`, `base_addr[9]`, `base_addr[13]` arrives at `{1,2,3,4}` given `int const* base_addr`. It seems like the function is calculating a byte offset into `base_addr`. That is, it seems more like a PowerPC API call treating `base_addr` as `void const*` or `uint8_t const*`. – jww Jun 16 '18 at 01:25
@jww: Yeah, the intrinsic maps to the asm instruction directly, so the pointer is just a base address and doesn't imply any scaling. If it were otherwise, you'd need a separate intrinsic to gather dwords from byte offsets (i.e. if you had indices that were already scaled in the vector). `int*` is not a meaningful choice for the pointer type (I assume / hope that `_mm_i32gather_epi32` is strict-aliasing safe when loading from any object(s)), but it may avoid some casting in the common case of using it on an `int[]`. The Operation section you quoted in the question documents this behaviour. – Peter Cordes Jun 16 '18 at 01:48

How to use vindex and scale with _mm_i32gather_epi32 to gather elements?

1 Answers1

Linked