How do the AVX(2) gather instructions actually compute the fetch address?

Question

The current Intel intrinsics guide for _mm_i32gather_epi32() describes the computed address for each subword as:

addr := base_addr + SignExtend64(vindex[m+31:m]) * ZeroExtend64(scale) * 8

That last 8 puzzles me. Assuming addr and base_addr are in bytes and scale takes a value of 1, 2, 4 or 8, then you can only ever index strides of 8 bytes from the base address. Is this an error in the docs, or am I missing something? It's described the same way for all the gather instructions I checked.

A previous question quotes the docs without that 8 which suggests something has changed.

Seems like a typo in the Intel Instructions Guide. SDM does not mention any fixed 8 multiplier for VPGATHERDD. — Andrey Semashev, Mar 26 '21 at 17:41

score 5 · Accepted Answer · answered Mar 26 '21 at 18:26

Note the next line in the pseudo-code:

dst[i+31:i] := MEM[addr+31:addr]

Apparently someone decided it would be a good idea to describe the memory address as a bit-address, not a byte-address. /facepalm. Which doesn't really make sense, is not what anyone would expect, and isn't even done right because they failed to scale base_addr by 8. So they're adding a bit-offset to a byte address.

This is just poor documentation, and is a worse way to try to describe it than the previous version quoted in the linked question. It's just a documentation change, not a change to what the code means, and you could have tried compiling it and looking at the asm to see the actual instruction generated. (My answer on the question you linked is still correct: the asm instruction allows a scale factor of 1, 2, 4, or 8, as a 2-bit shift count encoded the same way scalar instructions do for scaled-index addressing modes. So you can use a vector of byte offsets.)

The previous better pseudo-code was:

dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]

So MEM[] (the virtual address space) is being indexed with the calculated byte offset, and the access width is 32-bit implied by the dst[31:0] bit width.

As a rule of thumb, intrinsics generally map as directly as possible to the asm instructions. They wouldn't choose to define it in a way that requires the compiler to emit a vpslld ymm0, ymm1, 3 to scale the index register before running vpgatherdd.

So you can consult the asm instruction's documentation (which sometimes has different pseudo-code, like in this case): https://www.felixcloutier.com/x86/vpgatherdd:vpgatherqd

...
    DATA_ADDR←BASE_ADDR + (SignExtend(VINDEX1[i+31:i])*SCALE + DISP;
    IF MASK[31+i] THEN
        DEST[i +31:i]←FETCH_32BITS(DATA_ADDR); // a fault exits the instruction
    FI;

Yeah, I had the same thought on bit vs byte address, but as you pointed out it makes no sense wrt to the base address. It's a bit of a cock up on such a high profile document. Thanks! — Henry Gomersall, Mar 26 '21 at 19:32
@HenryGomersall: You can submit a bug report about the docs on Intel's forums, I think there's a section of it for reports about stuff like that (and about their compilers). IDK if there's any more direct feedback link for the intrinsics guide. — Peter Cordes, Mar 26 '21 at 19:35
Clicking on the "?" next to the search bar provides this link: http://software.intel.com/en-us/forums/topic/363747 — chtz, Mar 26 '21 at 21:52

How do the AVX(2) gather instructions actually compute the fetch address?

1 Answers1