26

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices.

What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches cache-lines one by one? Or, can it issue a load to multiple cache-lines at once?

I read a couple of papers which state the former (and that's the one which makes more sense to me), but I would like to know a bit more about this.

Link to one paper: http://arxiv.org/pdf/1401.7494.pdf

Paul R
  • 208,748
  • 37
  • 389
  • 560
Anuj Kalia
  • 803
  • 8
  • 16

2 Answers2

19

I did some benchmarking of the AVX gather instructions (on a Haswell CPU) and it seems to be a fairly simple brute force implementation - even when the elements to be loaded are contiguous it seems that there is still one read cycle per element, so performance is really no better than just doing scalar loads.

NB: this answer is now obsolete as things have changed considerably since Haswell. See the accepted answer for full details (unless you happen to be targeting Haswell CPUs).

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 2
    Looking at Agner's tables, it's like 20+ uops. So yes, I wouldn't call that native support. It'd be interesting to see what Skylake does. Probably closer to what GPUs do? (# of cycles = # of bank conflicts) – Mysticial Feb 14 '14 at 17:03
  • In Agner Fog's Vectorclass he has a function called `lookup` which can be used to gather non-contigous data from an array or for a table lookup and he claims the efficiency is good in AVX2 and only medium otherwise. Based on that gathering is better with Haswell (or scalar loads are better). But I have not done any benchmarking. I'm just going by his claim. – Z boson Feb 14 '14 at 19:57
  • Well I guess you're issuing fewer instructions to do the same amount of work when you use a gathered load, but the number of read cycles appears to be the same - perhaps it helps the instruction mix though, depending on what you're actually doing with the data once you've loaded it. – Paul R Feb 14 '14 at 22:36
  • 2
    @PaulR, maybe the gather is useful when the data is in the same cache line ? Perhaps this is useful to to convert SoA to Aos without having to do a transpose (assuming the struct fits in a cache line). – Z boson Feb 15 '14 at 20:30
  • 3
    I've tested it with contiguous data in the same cache line and haven't seen any benefit - the only gain seems to be that you don't need to do scalar loads and then combine these into a vector. – Paul R Feb 16 '14 at 07:10
  • @PaulR, why did Intel implement these gather instructions then? – Z boson Feb 18 '14 at 09:39
  • Well the gathered loads are still *useful* for the general case where you have a large or pseudo-random stride. They just don't have a particularly efficient interface to the cache/memory hierarchy. – Paul R Feb 18 '14 at 16:49
  • 2
    @PaulR : maybe for future optimization in next CPUs. See for instance how much unaligned load/store has been optimized since its introduction with SSE1, where it has virtually no benefit. – galinette Apr 20 '15 at 12:33
  • 2
    I recently had to do something that required a true gather-load. (i.e. `data[index[i]]`). On Haswell, 4 index loads + 2x `movsd` + 2x `movhpd` + `vinsertf128` is *still* significantly faster than a ymm load + `vgatherqpd`. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though. – Mysticial Dec 09 '15 at 22:22
  • 5
    On the other hand, I have a new laptop that has Skylake chip in it. I found a list of Skylake instruction latencies/throughputs. But they lack the gather instructions. When I get the time, I'll try to test it. It might serve as a precursor to what the AVX512 gather/scatter performance will be like. There's some pretty strong evidence, that the SIMD unit on the desktop Skylake really is just half width of the AVX512 versions (everything else being the same). So whatever we see on the current Skylakes will probably be very similar, if not the same, as the future ones with AVX512. – Mysticial Dec 09 '15 at 22:27
  • 5
    As of Knights Landing AVX512, gather/scatters are still broken up into uops. Gathers run at 2 lanes/cycle and scatters at 1 lane/cycle. So precisely matching the 2 load/1 store port architecture. It looks like Skylake is the same. So the improvement over previous generation is elimination of all the overhead ops leaving behind just the raw memory accesses. – Mysticial Sep 14 '16 at 15:16
  • I took a quick at look at how GPUs handle gather/scatters and it appears they can't do them as efficiency as normal load/stores either. TBH, from a hardware design POV, I don't see how it's possible to make gather/scatters truly "fast" without adding a load/store port for every lane. The area cost for a "port" to an L1 cache of N bytes is likely to be `O(N)` since you need to multiplex on every byte. Multiplying that the # of SIMD lanes is prohibitive. And that's probably why there's only 2 read ports and 1 write port. And this is putting aside the complexity of resolving scatter conflicts. – Mysticial Sep 14 '16 at 15:20
  • Yes, it would probably take a lot of extra silicon to implement efficient scatter/gather (especially if it's going to do intelligent things like coalescing loads/stores for adjacent addresses - although nVidia GPUs do this to some extent, so it's not beyond the realm of possibility). – Paul R Sep 14 '16 at 15:36
  • 2
    Knights Corner did in-cache-line coalescing. So multiple accesses that fall on the same cacheline would come in as one access via the 64-byte wide port for AVX512. Then within the load/store unit, it would do the necessary shuffling/merging to make it happen. Apparently they took it out of KNL. It was probably too complicated or something. If we forbid lane-conflicts to execute on the same port on the same cycle, then I *believe* it is possible to reuse some of the per-port muxing logic for multiple lanes on the same cycle. But that also requires conflict-detection logic. (which AVX512 has...) – Mysticial Sep 14 '16 at 15:46
  • Yes, the conflict-detection logic in KNL looks like it might be good for things like histogramming - I haven't had a chance to play with it yet though. – Paul R Sep 14 '16 at 15:56
  • Gather instructions in AVX2 allow to mask out some indice. Would you mind adding some information on that (do they still take time/ops)? Btw from my expirience gather has a net benefit in practical code even on Haswell (at least my code was ~10% faster when switching). – Christoph Diegelmann May 24 '17 at 08:58
  • @Christoph: I haven't really played with the masked gather capability on AVX2. You may well find that masked loads are useful in some contexts, since they may save messy transitions to/from scalar code. – Paul R May 24 '17 at 09:11
  • @Mysticial, "gather/scatters are still broken up into uops. Gathers run at 2 lanes/cycle and scatters at 1 lane/cycle.". Is this based on empirical observation or documented by Intel? – Fabio Jun 01 '17 at 10:04
  • @Mysticial, thank you. One more question: does the term "lanes" refer to "cache lines"? – Fabio Jun 02 '17 at 08:08
  • @Fabio SIMD lanes. – Mysticial Jun 02 '17 at 15:48
  • @Mysticial amd/nvidia GPUs spend one cycle for each retrieved cache line when loading from L1. Then data retrieved are just shuffled around to fit into register. (One load cycle is usually 2-4 basic cycles, f.e. Maxwell GPU documented as having 8 load units per 32 ALU units). Loads from shared memory are even more efficient - they spend multiple cycles only when loading multiple different values from the same bank. It's well documented in various tutorials and videos. Moreover, both GPUs support shuffle commands reusing LD/ST hardware to just shuffle register contents. – Bulat Aug 11 '18 at 02:19
  • @Mysticial: KNC's version of proto-AVX512 had some shuffles built-in to register (and memory?) source operands. (Kind of like how ARM / AArch64 has a barrel-shifter option for register source operands). Maybe this is the same shuffle hardware that gathers could use to grab multiple elements from one cache line on KNC? Interesting (but unfortunately no longer of much relevance). – Peter Cordes Jun 19 '19 at 16:53
  • 1
    In Broadwell and later masking out some gathered (i.e., gathering less than the full number of elements by using a mask register with some zero elements) elements, doesn't speed things up for L1 hits - so the same amount of work is done. For L1 misses it could obviously be faster since the requests won't be made to the outer levels of the memory hierarchy. – BeeOnRope Aug 24 '19 at 06:15
12

Gather was first implemented with Haswell but was not optimized until Broadwell (the first generation after Haswell).

I wrote my own code to test gather (see below). Here is a summary on Skylake, SkylakeX (with a dedicated AVX512 port), and KNL systems.

                 scalar    auto   AVX2   AVX512
Skylake GCC        0.47    0.38   0.38       NA
SkylakeX GCC       0.56    0.23   0.35     0.24
KNL GCC            3.95    1.37   2.11     1.16
KNL ICC            3.92    1.17   2.31     1.17

From the table it's clear that in all cases gather loads are faster than scalar loads (for the benchmark I used).

I'm not sure how Intel implements gather internally. The masks don't seem to have an effect on performance for gather. That's one thing Intel could optimize (if you only read one scalar value to due the mask it should be faster than gathering all values and then using the mask.

The Intel manual shows some nice figures on gather

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
DCU = L1 Data Cache Unit. MCU = mid-level = L2 cache. LLC = last-level = L3 cache. L3 is shared, L2 and L1d are per-core private.
Intel is just benchmarking gathers, not using the result for anything.

enter image description here enter image description here

//gather.c
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>

#define N 1024
#define R 1000000

void foo_auto(double * restrict a, double * restrict b, int *idx, int n);
void foo_AVX2(double * restrict a, double * restrict b, int *idx, int n);
void foo_AVX512(double * restrict a, double * restrict b, int *idx, int n);
void foo1(double * restrict a, double * restrict b, int *idx, int n);
void foo2(double * restrict a, double * restrict b, int *idx, int n);
void foo3(double * restrict a, double * restrict b, int *idx, int n);


double test(int *idx, void (*fp)(double * restrict a, double * restrict b, int *idx, int n)) {
  double a[N];
  double b[N];
  double dtime;

  for(int i=0; i<N; i++) a[i] = 1.0*N;
  for(int i=0; i<N; i++) b[i] = 1.0;
  fp(a, b, idx, N);
  dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) fp(a, b, idx, N);
  dtime += omp_get_wtime();
  return dtime;
}

int main(void) {

  //for(int i=0; i<N; i++) idx[i] = N - i - 1;
  //for(int i=0; i<N; i++) idx[i] = i;
  //for(int i=0; i<N; i++) idx[i] = rand()%N;

  //for(int i=0; i<R; i++) foo2(a, b, idx, N);
  int idx[N];
  double dtime;
  int ntests=2;
  void (*fp[4])(double * restrict a, double * restrict b, int *idx, int n);
  fp[0] = foo_auto;
  fp[1] = foo_AVX2;
#if defined ( __AVX512F__ ) || defined ( __AVX512__ )
  fp[2] = foo_AVX512;
  ntests=3;
#endif     

  for(int i=0; i<ntests; i++) { 
    for(int i=0; i<N; i++) idx[i] = 0;
    test(idx, fp[i]);
    dtime = test(idx, fp[i]);
    printf("%.2f      ", dtime);

    for(int i=0; i<N; i++) idx[i] = i;
    test(idx, fp[i]);
    dtime = test(idx, fp[i]);
    printf("%.2f      ", dtime);

    for(int i=0; i<N; i++) idx[i] = N-i-1;
    test(idx, fp[i]);
    dtime = test(idx, fp[i]);
    printf("%.2f      ", dtime);

    for(int i=0; i<N; i++) idx[i] = rand()%N;
    test(idx, fp[i]);
    dtime = test(idx, fp[i]);
    printf("%.2f\n", dtime);
  }

  for(int i=0; i<N; i++) idx[i] = 0;
  test(idx, foo1);
  dtime = test(idx, foo1);
  printf("%.2f      ", dtime);

  for(int i=0; i<N; i++) idx[i] = i;
  test(idx, foo2);
  dtime = test(idx, foo2);
  printf("%.2f      ", dtime);

  for(int i=0; i<N; i++) idx[i] = N-i-1;
  test(idx, foo3);
  dtime = test(idx, foo3);
  printf("%.2f      ", dtime);
  printf("NA\n");
}

//foo2.c
#include <x86intrin.h>
void foo_auto(double * restrict a, double * restrict b, int *idx, int n) {
  for(int i=0; i<n; i++) b[i] = a[idx[i]];
}

void foo_AVX2(double * restrict a, double * restrict b, int *idx, int n) {
  for(int i=0; i<n; i+=4) {
    __m128i vidx = _mm_loadu_si128((__m128i*)&idx[i]);
    __m256d av = _mm256_i32gather_pd(&a[i], vidx, 8);
    _mm256_storeu_pd(&b[i],av);
  }
}

#if defined ( __AVX512F__ ) || defined ( __AVX512__ )
void foo_AVX512(double * restrict a, double * restrict b, int *idx, int n) {
  for(int i=0; i<n; i+=8) {
    __m256i vidx = _mm256_loadu_si256((__m256i*)&idx[i]);
    __m512d av = _mm512_i32gather_pd(vidx, &a[i], 8);
    _mm512_storeu_pd(&b[i],av);
  }
}
#endif

void foo1(double * restrict a, double * restrict b, int *idx, int n) {
  for(int i=0; i<n; i++) b[i] = a[0];
}

void foo2(double * restrict a, double * restrict b, int *idx, int n) {
  for(int i=0; i<n; i++) b[i] = a[i];
}

void foo3(double * restrict a, double * restrict b, int *idx, int n) {
  for(int i=0; i<n; i++) b[i] = a[n-i-1];
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • wow [clang goes completely nuts with that code](https://godbolt.org/z/nd358W). What did you compile with for these tests? – Noah Jan 12 '21 at 08:11