13

I have a number of tight loops I'm trying to optimize with GCC and intrinsics. Consider for example the following function.

void triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    int i;
    __m256 k4 = _mm256_set1_ps(k);
    for(i=0; i<n; i+=8) {
        _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
    }
}

This produces a main loop like this

20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add    rax,0x20
33: cmp    rax,rcx
36: jne    20 

But the cmp instruction is unnecessary. Instead of having rax start at zero and finish at sizeof(float)*n we can set the base pointers (rsi, rdi, and rdx) to the end of the array and set rax to -sizeof(float)*n and then test for zero. I am able to do this with my own assembly code like this

.L2  vmulps          ymm1, ymm2, [rdi+rax]
     vaddps          ymm0, ymm1, [rsi+rax]
     vmovaps         [rdx+rax], ymm0
     add             rax, 32
     jne             .L2

but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the cmp instruction is not produced with GCC.

I tried the following but it still produces cmp. All variations I have tried still produce cmp.

void triad2(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    float *x2 = x+n;
    float *y2 = y+n;
    float *z2 = z+n;    
    int i;
    __m256 k4 = _mm256_set1_ps(k);
    for(i=-n; i<0; i+=8) {
        _mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
    }
}

Edit: I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for n=2048). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).

Edit: Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using cmp, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).

                 SSE         AVX         FMA
intrinsic      71.3%       90.9%       53.6%      
unroll1        97.0%       96.1%       63.5%
unroll16       98.6%       90.4%       93.6%
ScottD         96.5%
32B code align             95.5%

For SSE I get almost as good a result without unrolling as with unroll but only if I don't use cmp. On AVX I get the best result without unrolling and without using cmp. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this question. The source code to test this can be found in that question.

Edit:

Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode). I'm not sure why the cmp matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra cmp. On Haswell only unrolling works anyway.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 2
    Something tells me that should probably be unrolling the loop more than you are now. – Mysticial Sep 18 '14 at 20:24
  • @Mysticial, I unroll 1, 4, 8, and 256 times (2048 elements) for many tests (copy, write, add, mult, triad, and more). The point is that in some cases the unrolling is unncessary but I have to remove the `cmp` to achieve that. – Z boson Sep 18 '14 at 20:55
  • If I were you, I'd take a closer look at what exactly these are being used for. Trying to optimize a single multiply-add pass is probably not going to get anywhere near optimal performance. If you're using this for something like matrix multiply, you should be combining multiple passes together so that you're doing a lot more work per memory access. – Mysticial Sep 18 '14 at 21:21
  • 1
    @Zboson: ah, yes, I see it now. I have no idea how to tell `gcc` to avoid the `cmp`. Clang replaced the cmp in your second one with a tst, but that's not much help. (Shouldn't the termination condition be `i < 0`?) – rici Sep 18 '14 at 21:32
  • 2
    Have you checked performance? I doubt you will be able to detect the difference between the two versions since the number of data accesses is the same. Accessing memory is almost always the performance bottleneck unless you have a very specialized use case. – Dwayne Towell Sep 18 '14 at 22:04
  • @Mysticial, these test are not for matrix mult. These are tests to maximize ILP. They are somewhat related to your "How do I achieve the theoretical maximum of 4 FLOPs per cycle?" tests. Except I don't want to achieve only FLOPS I also want bandwidth. But not only FLOPS and bandwidth. I want maximum ILP for these functions. Unrolling gets closer to maximum FLOPS and bandwidth but it cannot maximize ILP. I don't know if this will be useful later. It's mostly to help me understand what the hardware is capable of. – Z boson Sep 19 '14 at 07:21
  • @DwayneTowell, I already stated "I have several tests now where this makes a significant difference." These are for small arrays that fit in the L1 cache. The number of data accesses is the same but the GCC version needs more cycles to complete this. – Z boson Sep 19 '14 at 07:24
  • @rici, you're correct, it should be `i < 0`. I did not actually test the second version. I only compiled it and used objdump to look at the assembly. – Z boson Sep 19 '14 at 08:17
  • 1
    To be clear, I compared the performance of first GCC version to the version I wrote in assembly (with NASM). – Z boson Sep 19 '14 at 08:34
  • @Mysticial, I edited my question with a table of results for SSE, AVX, and FMA with intrinsics and with and without unrolling. – Z boson Sep 19 '14 at 19:12
  • @Zboson What do the %'s mean? % of peak flops? (IOW, what are the cycles/iteration counts?) – Mysticial Sep 19 '14 at 19:22
  • @Mysticial, good point. I edited the question again. It's the percentage of the peak performance which is frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA. – Z boson Sep 19 '14 at 19:26
  • If you don't want a cmp instruction, loop backwards. That way, after the dec, you just jne. – cup Sep 19 '14 at 19:32
  • @cup, that's sorta what I did in version to. But I start at -n and loop up to zero. It still produced `cmp`. But if you come up with a solution that does not produce `cmp` please post an answer! – Z boson Sep 19 '14 at 19:34
  • If you can write faster assembly then the compiler/intrinsics, then why don't you just use your inlined assembly? – Degustaf Sep 19 '14 at 19:53
  • 1
    @Degustaf, because GCC normally does a very good job of optimization and I'm not convinced yet that assembly is necessary. – Z boson Sep 19 '14 at 20:22
  • 2
    Just a heads-up, I've found a way to do it optimally in gcc without intrinsics (just builtins, which is *obvoiously* better, right?). – EOF Sep 23 '14 at 14:12
  • Not sure these answers address the issue. I've been wondering about this with simple zero/copy loops (counting down) with gcc-5.1. I still get the `sub / cmp / jcc` sequence, when `sub / jcc` would do. – Brett Hale Jun 12 '15 at 00:24
  • @BrettHale, you mean ScottD's answer below did not work for you? We should post this as a bug report for GCC (somebody else suggested I do that). I'm not sure it will have high priority. – Z boson Jun 12 '15 at 08:04

3 Answers3

6

How about this. Compiler is gcc 4.9.0 mingw x64:

void triad(float *x, float *y, float *z, const int n) {
    float k = 3.14159f;
    intptr_t i;
    __m256 k4 = _mm256_set1_ps(k);

    for(i = -n; i < 0; i += 8) {
        _mm256_store_ps(&z[i+n], _mm256_add_ps(_mm256_load_ps(&x[i+n]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i+n]))));
    }
}

gcc -c -O3 -march=corei7 -mavx2 triad.c

0000000000000000 <triad>:
   0:   44 89 c8                mov    eax,r9d
   3:   f7 d8                   neg    eax
   5:   48 98                   cdqe
   7:   48 85 c0                test   rax,rax
   a:   79 31                   jns    3d <triad+0x3d>
   c:   c5 fc 28 0d 00 00 00 00 vmovaps ymm1,YMMWORD PTR [rip+0x0]
  14:   4d 63 c9                movsxd r9,r9d
  17:   49 c1 e1 02             shl    r9,0x2
  1b:   4c 01 ca                add    rdx,r9
  1e:   4c 01 c9                add    rcx,r9
  21:   4d 01 c8                add    r8,r9

  24:   c5 f4 59 04 82          vmulps ymm0,ymm1,YMMWORD PTR [rdx+rax*4]
  29:   c5 fc 58 04 81          vaddps ymm0,ymm0,YMMWORD PTR [rcx+rax*4]
  2e:   c4 c1 7c 29 04 80       vmovaps YMMWORD PTR [r8+rax*4],ymm0
  34:   48 83 c0 08             add    rax,0x8
  38:   78 ea                   js     24 <triad+0x24>

  3a:   c5 f8 77                vzeroupper
  3d:   c3                      ret

Like your hand written code, gcc is using 5 instructions for the loop. The gcc code uses scale=4 where yours uses scale=1. I was able to get gcc to use scale=1 with a 5 instruction loop, but the C code is awkward and 2 of the AVX instructions in the loop grow from 5 bytes to 6 bytes.

3

Final code:

#define SF sizeof(float)
#ifndef NO                   //floats per vector, compile with -DNO = 1,2,4,8,...
#define NO 8                 //MUST be power of two
#endif

void triadfinaler(float const *restrict x, float const *restrict y,   \
                  float *restrict z, size_t n)
{
  float *restrict d = __builtin_assume_aligned(z, NO*SF);       //gcc builtin,
  float const *restrict m = __builtin_assume_aligned(y, NO*SF); //optional but produces
  float const *restrict a = __builtin_assume_aligned(x, NO*SF); //better code
  float const k = 3.14159f;
  n*=SF;
  while (n &= ~((size_t)(NO*SF)-1))    //this is why NO*SF must be power of two
    {
      size_t nl = n/SF;
      for (size_t i = 0; i<NO; i++)
        {
          d[nl-NO+i] = k * m[nl-NO+i] + a[nl-NO+i];
        }
      n -= (NO*SF);
    }
}

I prefer to let the compiler choose the instructions, rather than using intrinsics (not least because you used intel-intrinsics, which gcc doesn't really like). Anyway, the following code produces nice assembly for me on gcc 4.8:

void triad(float *restrict x, float *restrict y, float *restrict z, size_t n)
//I hope you weren't aliasing any function arguments... Oh, an it's void, not float
{
  float *restrict d = __builtin_assume_aligned(z, 32);  // Uh, make sure your arrays
  float *restrict m = __builtin_assume_aligned(y, 32);  // are aligned? Faster that way
  float *restrict a = __builtin_assume_aligned(x, 32);  //
  float const k = 3.14159f;
  while (n &= ~((size_t)0x7))       //black magic, causes gcc to omit code for non-multiples of 8 floats
    {
      n -= 8;                       //You were always computing on 8 floats at a time, right?
      d[n+0] = k * m[n+0] + a[n+0]; //manual unrolling
      d[n+1] = k * m[n+1] + a[n+1];
      d[n+2] = k * m[n+2] + a[n+2];
      d[n+3] = k * m[n+3] + a[n+3];
      d[n+4] = k * m[n+4] + a[n+4];
      d[n+5] = k * m[n+5] + a[n+5];
      d[n+6] = k * m[n+6] + a[n+6];
      d[n+7] = k * m[n+7] + a[n+7];
    }
}

This produces nice code for my corei7avx2, with -O3:

triad:
    andq    $-8, %rcx
    je  .L8
    vmovaps .LC0(%rip), %ymm1

.L4:
    subq    $8, %rcx
    vmovaps (%rsi,%rcx,4), %ymm0
    vfmadd213ps (%rdi,%rcx,4), %ymm1, %ymm0
    vmovaps %ymm0, (%rdx,%rcx,4)
    andq    $-8, %rcx
    jne .L4
    vzeroupper
.L8:
    rep ret
    .cfi_endproc

.LC0:
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000

Edit: I was a bit disappointed with the compiler not optimizing this code down to the last instruction, so I messed around with it a bit more. Just changing the order of things in the loop got rid of the AND emitted by the compiler, which got me on the right track. I then only had to get it to not do unnecessary address calculation in the loop instead. Sigh.

void triadtwo(float *restrict x, float *restrict y, float *restrict z, size_t n)
{
  float *restrict d = __builtin_assume_aligned(z, 32);
  float *restrict m = __builtin_assume_aligned(y, 32);
  float *restrict a = __builtin_assume_aligned(x, 32);
  float const k = 3.14159f;
  n<<=2;
  while (n &= -32)
    {
      d[(n>>2)-8] = k * m[(n>>2)-8] + a[(n>>2)-8];
      d[(n>>2)-7] = k * m[(n>>2)-7] + a[(n>>2)-7];
      d[(n>>2)-6] = k * m[(n>>2)-6] + a[(n>>2)-6];
      d[(n>>2)-5] = k * m[(n>>2)-5] + a[(n>>2)-5];
      d[(n>>2)-4] = k * m[(n>>2)-4] + a[(n>>2)-4];
      d[(n>>2)-3] = k * m[(n>>2)-3] + a[(n>>2)-3];
      d[(n>>2)-2] = k * m[(n>>2)-2] + a[(n>>2)-2];
      d[(n>>2)-1] = k * m[(n>>2)-1] + a[(n>>2)-1];
      n -= 32;
    }
}

Ugly code? Yes. But the assembly:

triadtwo:
    salq    $2, %rcx
    andq    $-32, %rcx
    je  .L54
    vmovaps .LC0(%rip), %ymm1

.L50:
    vmovaps -32(%rsi,%rcx), %ymm0
    vfmadd213ps -32(%rdi,%rcx), %ymm1, %ymm0
    vmovaps %ymm0, -32(%rdx,%rcx)
    subq    $32, %rcx
    jne .L50
    vzeroupper
.L54:
    rep ret
    .cfi_endproc
.LC0:
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000
    .long   1078530000

Mmmmhhh, glorious five instructions in the loop, macro-op fusable subtract-and-branch...

EOF
  • 6,273
  • 2
  • 26
  • 50
  • That's an interesting approach (+1). You succeeded in getting rid of `cmp` by replaced it with `and`. But I don't think that's any better. – Z boson Sep 20 '14 at 14:57
  • @Zboson: Yeah, the `AND` *shouldn't* be necessary, but gcc doesn't understand that when (n%8 == 0) it also follows that ((n-8)%8 == 0). Don't ask me why. n &= -8 works, and macro-op fusion should make it practically free. – EOF Sep 20 '14 at 19:45
  • Good call on the return. Yes is should be void instead of float. It's a hang over from some reduction tests which were returning float. That's what I get for not using `-Wall`. I tried `n &-8` and it still produces the `and`. Macro-op fusion would work on the and and jump (but not on pre SB processors) but not on the `sub`. It reduces the micros from 3 to 2. But without the `cmp` or `and` it's only 2 anyway. – Z boson Sep 21 '14 at 07:48
  • As to `restrict` it's not necessary when using intrinsics. Either is specificity the alignment. That's why I used neither. If you look at my intrinsic code you can see that it already assumes the arrays don't overlap and that the arrays are aligned. However, it is necessary to specify `restrict` and alignment when not using intrinsics. – Z boson Sep 21 '14 at 07:50
  • Awesome that you got this working without intrinsics. If it was possible to have two accepted answers this would be a perfect candidate for it. – Z boson Sep 23 '14 at 14:23
  • @Zboson: I just tested it without the alignment restriction. The only difference in the assembly is that gcc uses unaligned loads (and one more instruction/load to ymm register). So this could well be done in pure, portable C11. – EOF Sep 23 '14 at 14:25
  • I guess this is a better solution if you assume the SIMD width of your CPU is eight and want to be agnostic to the rest. I wonder what the assembly is with SSE which is four wide. – Z boson Sep 23 '14 at 14:27
  • Without the alignment restriction GCC won't fuse the load and mult which means it will need more microps so the alignment restriction is necessary for performance. – Z boson Sep 23 '14 at 14:30
  • I just put a predecessor check for the GCC compiler when restricting the alignment. Then it works in C11 regardless of GCC (but only produces the most efficiency code with GCC). – Z boson Sep 23 '14 at 14:32
  • @Zboson: Yeah, it's obviously optimized for gcc /w AVX(2) in particular. SSE and the like produce reasonable code, but nothing close to AVX2. FMA is actually not all that important for this code, same number of instructions when using the alignment restrictions without it. Anyway, my point is that this provides a *reasonable* fallback when the vector unit is smaller, and can be used without alignment restrictions on other compilers, not that it will always be optimal. – EOF Sep 23 '14 at 14:39
2

The instruction decoder on Intel Ivy Bridge or later can fuse the cmp and jne into a single operation in the pipeline (called macro-op fusion), so on these recent processors the cmp should disappear anyway.

Jeremy
  • 304
  • 1
  • 6
  • 2
    Yes, but they can't fuse the `add`, `cmp`, and `jne` instruction into "a singe operation". That's the whole point! Before SB it was not possible to fuse `add` and `jne`. But since SB it is. Using `cmp` requires one more μop. – Z boson Sep 19 '14 at 10:51
  • And to be more precise all Core2 processor can fuse `cmp` and `jne` in 32-bit mode. And all processors since Nehalem can fuse those in 64-bit mode. And all of them since Sandy Bridge can fuse `add` and `jne`. However, there are several cases that can cause the fusing to fail. – Z boson Sep 19 '14 at 11:00