How to do an indirect load (gather-scatter) in AVX or SSE instructions?

Question

I've been searching for a while now, but can't seem to find anything useful in the documentation or on SO. This question didn't really help me out, since it makes references to modifying the assembly and I am writing in C.

I have some code making indirect accesses that I want to vectorize.

for (i = 0; i < LENGTH; ++i) {
   foo[bar[i]] *= 2;
}

Since I have the indices I want to double inside bar, I was wondering if there was a way to load those indices of foo into a vector register and then I could apply my math and store it back to the same indices.

Something like the following. The load and store instructions I just made up because I couldn't find anything like them in the AVX or SSE documentation. I think I read somewhere that AVX2 has similar functions, but the processor I'm working with doesn't support AVX2.

for (i = 0; i < LENGTH; i += 8) {
   // For simplicity, I'm leaving out any pointer type casting
   __m256 ymm0 = _mm256_load_indirect(bar+i);
   __m256 ymm1 = _mm256_set1_epi32(2); // Set up vector of just 2's
   __m256 ymm2 = _mm256_mul_ps(ymm0, ymm1);
   _mm256_store_indirect(ymm2, bar+i);
}

Are there any instructions in AVX or SSE that will allow me to load a vector register with an array of indices from a different array? Or any "hacky" ways around it if there isn't an explicit function?

You can't get blood out of a stone - there is some support for gathered loads in AVX2 and AVX-512 but if you only have SSE/AVX then you're simply out of luck. — Paul R, May 01 '16 at 21:02
It's only worth gathering and scattering if you have multiple vector operations once you have your vector gathered. Look in Intel's intrinsics guide for "gather". (Links in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info).) Other than `vgatherdps`, there are insns like `insertps` to "manually" gather/scatter. You said something about "modifying the asm", but every asm vector instruction has an intrinsic you can use in C. — Peter Cordes, May 02 '16 at 02:07
@PaulR, you looked into gather with Haswell. You have done this with Skylake? Agner said that Intel improved gather on Broadwell and again on Skylake? Perhaps it's finally useful on Skylake? — Z boson, May 02 '16 at 06:54
@Zboson: no, I don't have anything with a Skylake CPU to test this on (I'm holding out for the Skylake Xeons next year). — Paul R, May 02 '16 at 07:02
Yeah, I guess you guys are right. Is it possible to close a question for having no answer? — The Unknown Dev, May 05 '16 at 23:41
For the record, it's covered in https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf chapter 11.5 — Trass3r, Nov 16 '18 at 13:57
Worth considering turning the list of indices in `bar` into a mask, and looping over `foo` using the mask to leave some unmodified, some `x+=x`. If the selected elements are dense enough, anyway. Like an average of at least 1 per vector. — Peter Cordes, Oct 11 '19 at 15:43

Pibben · Answer 1 · 2020-03-09T11:27:28.407

(I' writing an answer to this old question as I think it may help others.)

Short answer

No. There are no scatter/gather instructions in the SSE and AVX instruction sets.

Longer answer

Scatter/gather instructions are expensive to implement (in terms of complexity and silicon real estate) because scatter/gather mechanism needs to be deeply intertwined with the cache memory controller. I believe this is the reason that this functionality was missing from SSE/AVX.

For newer instruction sets the situation is different. In AVX2 you have

VGATHERDPD, VGATHERDPS, VGATHERQPD, VGATHERQPS for floating point gather (intrinsics here)
VPGATHERDD, VPGATHERQD, VPGATHERDQ, VPGATHERQQ for integer gather (intrinsics here)

In AVX-512 we got

VSCATTERDPD, VSCATTERDPS, VSCATTERQPD, VSCATTERQPS for floating point scatter (intrinsics here)
VPSCATTERDD, VPSCATTERQD, VPSCATTERDQ, VPSCATTERQQ for integer scatter (intrinsics here)

However, it is still a question whether using scatter/gather for such a simple operation would actually pay off.

For pure gather (read-only, without the scatter at the end), `vpgatherXX` is often worth it on Skylake and newer. See [In what situation would the AVX2 gather instructions be faster than individually loading the data?](https://stackoverflow.com/q/24756534) — Peter Cordes, Mar 20 '23 at 09:53

How to do an indirect load (gather-scatter) in AVX or SSE instructions?

1 Answers1

Short answer

Longer answer

Linked

Related