In what situation would the AVX2 gather instructions be faster than individually loading the data?

Question

I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point array is permuted and added to another. In c, this can be implemented as

void vectortest(double * a,double * b,unsigned int * ind,unsigned int N)
{
    int i;
    for(i=0;i<N;++i)
    {
        a[i]+=b[ind[i]];
    }
}

I compile this function with g++ -O3 -march=native. Now, I implement this in assembly in three ways. For simplicity I assume that the length of the arrays N is divisible by four. The simple, non-vectorized implementation:

align 4
global vectortest_asm
vectortest_asm:
        ;;  double * a = rdi                                                                                                                                                                                                                                   
        ;;  double * b = rsi                                                                                                                                                                                                                                   
        ;;  unsigned int * ind = rdx                                                                                                                                                                                                                           
        ;;  unsigned int N = rcx                                                                                                                                                                                                                               

        push rax
        xor rax,rax

loop:   sub rcx, 1
        mov eax, [rdx+rcx*4]    ;eax = ind[rcx]                                                                                                                                                                                                                
        vmovq xmm0, [rdi+rcx*8]         ;xmm0 = a[rcx]                                                                                                                                                                                                         
        vaddsd xmm0, [rsi+rax*8]        ;xmm1 += b[rax] ( and b[rax] = b[eax] = b[ind[rcx]])                                                                                                                                                                   
        vmovq [rdi+rcx*8], xmm0
        cmp rcx, 0
        jne loop

        pop rax

        ret

The loop vectorised without the gather instruction:

loop:   sub rcx, 4

        mov eax,[rdx+rcx*4]    ;first load the values from array b to xmm1-xmm4
        vmovq xmm1,[rsi+rax*8]
        mov eax,[rdx+rcx*4+4]
        vmovq xmm2,[rsi+rax*8]

        mov eax,[rdx+rcx*4+8]
        vmovq xmm3,[rsi+rax*8]
        mov eax,[rdx+rcx*4+12]
        vmovq xmm4,[rsi+rax*8]

        vmovlhps xmm1,xmm2     ;now collect them all to ymm1
        vmovlhps xmm3,xmm4
        vinsertf128 ymm1,ymm1,xmm3,1

        vaddpd ymm1, ymm1, [rdi+rcx*8]
        vmovupd [rdi+rcx*8], ymm1

        cmp rcx, 0
        jne loop

And finally, an implementation using vgatherdpd:

loop:   sub rcx, 4               
        vmovdqu xmm2,[rdx+4*rcx]           ;load the offsets from array ind to xmm2
        vpcmpeqw ymm3,ymm3                 ;set ymm3 to all ones, since it acts as the mask in vgatherdpd                                                                                                                                                                 
        vgatherdpd ymm1,[rsi+8*xmm2],ymm3  ;now gather the elements from array b to ymm1

        vaddpd ymm1, ymm1, [rdi+rcx*8]
        vmovupd [rdi+rcx*8], ymm1

        cmp rcx, 0
        jne loop

I benchmark these functions on a machine with a Haswell cpu (Xeon E3-1245 v3). Some typical results are (times in seconds):

Array length 100, function called 100000000 times.
Gcc version: 6.67439
Nonvectorized assembly implementation: 6.64713
Vectorized without gather: 4.88616
Vectorized with gather: 9.32949

Array length 1000, function called 10000000 times.
Gcc version: 5.48479
Nonvectorized assembly implementation: 5.56681
Vectorized without gather: 4.70103
Vectorized with gather: 8.94149

Array length 10000, function called 1000000 times.
Gcc version: 7.35433
Nonvectorized assembly implementation: 7.66528
Vectorized without gather: 7.92428
Vectorized with gather: 8.873

The gcc and the nonvectorized assembly version are very close to each other. (I also checked the assembly output of gcc, which is quite similar to my hand coded version.) The vectorization gives some benefit for small arrays, but is slower for large arrays. The big surprise (to me at least) is that the version using vgatherpdp is so slow. So, my question is, why? Am I doing something stupid here? Can someone provide an example where the gathering instruction would actually give a performance benefit over just doing multiple loading operations? If not, what is the point of actually having such an instruction?

The test code, complete with a makefile for g++ and nasm, is available at https://github.com/vanhala/vectortest.git in case somebody wants to try this out.

Well, it's not too surprising your hand-coded functions are faster, the C compiler has to produce *correct code*, after all. Your loops have no provision for array lengths that are not a multiple of the vectorization size, and don't even check if the count was zero... — EOF, Jul 15 '14 at 19:30
@EOF Yes, but this is beside the point. The main point of this benchmark was to compare the efficiency of the gathered load instruction versus implementing the same thing using scalar loads. The compiler generated version was there just to make sure that the times are in the right ballpark, i.e. to check that I'm not doing anything completely stupid in the hand coded versions. — infinitesimal, Jul 16 '14 at 07:22

score 17 · Answer 1 · answered Jan 10 '19 at 22:40

Newer microarchitectures have shifted the odds towards gather instructions. On an Intel Xeon Gold 6138 CPU @ 2.00 GHz with Skylake microarchitecture, we get for your benchmark:

9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 
9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 
9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 
9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 
Array length 10000, function called 1000000 times.
Gcc version: 6.32353
Nonvectorized assembly implementation: 6.36922
Vectorized without gather: 5.53553
Vectorized with gather: 4.50673

showing that gathers may now be well worth the effort.

fwiw on `AMD Ryzen 7 PRO 4750U` I get: `6.35667 ... 5.52097 ... 6.89521` — jberryman, Aug 25 '23 at 23:09

Paul R · Accepted Answer · 2014-07-15T11:50:24.697

14

Unfortunately the gathered load instructions are not particularly "smart" - they seem to generate one bus cycle per element, regardless of the load addresses, so even if you happen to have contiguous elements there is apparently no internal logic for coalescing the loads. So in terms of efficiency a gathered load is no better than N scalar loads, except that it uses only one instruction.

The only real benefit of the gather instructions is when you are implementing SIMD code anyway, and you need to load non-contiguous data to which you are then going to apply further SIMD operations. In that case a SIMD gathered load instruction will be a lot more efficient than a bunch of scalar code that would typically be generated by e.g. _mm256_set_xxx() (or a bunch of contiguous loads and permutes, etc, depending on the actual access pattern).

edited Jul 15 '14 at 11:50

answered Jul 15 '14 at 11:44

Paul R

208,748
37
389
560

3

I'm not sure I understand your latter point. In the example above, I load non-contiguous data from array b, and then apply some SIMD-instructions to that data. In this case replacing the gather instruction by a bunch of scalar movs yields faster code. Can you provide a pointer to an actual benchmark or example where the gathered load would not be slower than multiple scalar loads? Or do you mean that it is just easier for the compiler to generate code using the gather instruction even if it is slower? – infinitesimal Jul 15 '14 at 12:41
And by "slower" in the last sentence of the comment above I mean "slower than a hand crafted loading code using scalar instructions". – infinitesimal Jul 15 '14 at 12:51
I was talking about the general case where you might use e.g. `_mm256_set_xxx` to load N scattered values, versus using a gathered load - typically the compiler will generate a lot of scalar code to do the former. Your example is somewhat different in that you've hand-coded some assembler for a specific use case. It also depends on the number of elements - I work mainly with 8 and 16 bit data (in AVX2 of course) where the problem of gathered loads is somewhat greater greater than with 32/64 bit elements. – Paul R Jul 15 '14 at 12:58
3

@infinitesimal Also, the gather can be conditional. Incidentally, you might want to try moving the `vpcmpeqw` out of the loop, storing the all 1 bits in a spare register that you just copy into the mask each time. – Jester Jul 15 '14 at 16:01
2

@Jester Yes, that additional logic might be one reason why it seems to be quite slow. Moving the vpcmpeqw makes no measurable difference in the execution times. – infinitesimal Jul 25 '14 at 20:32
4

How has this changed with newer microarchitectures? – fuz Jan 10 '19 at 22:23

In what situation would the AVX2 gather instructions be faster than individually loading the data?

2 Answers2

Linked

Related