Add+Mul become slower with Intrinsics - where am I wrong?

Question

Having this array:

alignas(16) double c[voiceSize][blockSize];

This is the function I'm trying to optimize:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
        pC[sampleIndex] = value + deltaValue * sampleIndex;
    }
}

And this is my intrinsics (SSE2) attempt:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    __m128d value_add = _mm_set1_pd(value);
    __m128d deltaValue_mul = _mm_set1_pd(deltaValue);

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
        __m128d result_mul = _mm_setr_pd(sampleIndex, sampleIndex + 1);
        result_mul = _mm_mul_pd(result_mul, deltaValue_mul);
        result_mul = _mm_add_pd(result_mul, value_add);

        _mm_store_pd(pC + sampleIndex, result_mul);
    }   
}

Which is slower than "scalar" (even if auto-optimized) original code, unfortunately :)

Where's the bottleneck in your opinion? Where am I wrong?

I'm using MSVC, Release/x86, /02 optimization flag (Favor fast code).

EDIT: doing this (suggested by @wim), it seems that performance become better than C version:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    __m128d value_add = _mm_set1_pd(value);
    __m128d deltaValue_mul = _mm_set1_pd(deltaValue);

    __m128d sampleIndex_acc = _mm_set_pd(-1.0, -2.0);
    __m128d sampleIndex_add = _mm_set1_pd(2.0);

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
        sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add);
        __m128d result_mul = _mm_mul_pd(sampleIndex_acc, deltaValue_mul);
        result_mul = _mm_add_pd(result_mul, value_add);

        _mm_store_pd(pC + sampleIndex, result_mul);
    }
}

Why? Is _mm_setr_pd expensive?

Does your hardware support AVX? Then the optimized code may use avx intrinsics. — Max, Dec 18 '18 at 13:34
Can you post something that we can compile? The compiler may generate better intrinsics for your platform! — Matthieu Brucher, Dec 18 '18 at 13:36
I would start with something like `__m128d sampleIndex_vec = _mm_set_pd(-1.0,-2.0);`, and `__m128d sampleIndex_add = _mm_set1_pd(2.0);` outside the loop. Inside the loop you can replace `__m128d result_mul = _mm_setr_pd(sampleIndex, sampleIndex + 1);` by `sampleIndex_vec = _mm_add_pd(sampleIndex_vec, sampleIndex_add);` and `result_mul = sampleIndex_vec`. In this way you get rid of the nasty `_mm_setr_pd(sampleIndex, sampleIndex + 1);`. (Not tested.) — wim, Dec 18 '18 at 15:02
@wim: nice, it really improve the performance! Now its better than C version... — markzzz, Dec 18 '18 at 15:02
@markzzz My first comment had an error, I deleted it and wrote a new comment. (Check the results to see if your SIMD code is correct) The idea is that you update the `sampleIndex` counter entirely in the SIMD domain, which is much more efficient than two integer to double conversions per iteration. Use gcc -S to see the difference between the two versions. — wim, Dec 18 '18 at 15:10
@markzzz I don't use MSVC, but maybe this [question](https://stackoverflow.com/q/1020498) is useful for you. I would highly recommend to inspect the generated assembly code while writing SIMD intrinsics code. — wim, Dec 18 '18 at 15:34
Do you need to multiply inside the loop? Is rounding error too much of a problem if you strength-reduce from `c + i*scale` to `tmp += step`? (Or `tmp0 += step*4; tmp1 += step*4; tmp2 += step*4; tmp3 += step*4;` to hide some of the FP add latency, and unroll by another factor of 4.) `_mm_set` inside the loop avoids a dependency bottleneck, but creates a throughput bottleneck. — Peter Cordes, Dec 18 '18 at 20:12
@PeterCordes: nope! Rounding error is not much of a problem, so I could increment with sum at each iteration. But I'm not sure what you are suggesting :( — markzzz, Dec 18 '18 at 20:25
@pm100: not at all! As Peter Cordes proof, human beat compiler, if you know what you are doing. — markzzz, Dec 19 '18 at 13:11

spectras · Answer 1 · 2018-12-18T21:18:52.590

3

On my system, g++ test.cpp -march=native -O2 -c -o test

This will output for the normal version (loop body extract):

  30:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
  34:   c5 fb 2a c0             vcvtsi2sd %eax,%xmm0,%xmm0
  38:   c4 e2 f1 99 c2          vfmadd132sd %xmm2,%xmm1,%xmm0
  3d:   c5 fb 11 04 c2          vmovsd %xmm0,(%rdx,%rax,8)
  42:   48 83 c0 01             add    $0x1,%rax
  46:   48 39 c8                cmp    %rcx,%rax
  49:   75 e5                   jne    30 <_Z11ProcessAutoii+0x30>

And for the intrinsics version:

  88:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
  8c:   8d 50 01                lea    0x1(%rax),%edx
  8f:   c5 f1 57 c9             vxorpd %xmm1,%xmm1,%xmm1
  93:   c5 fb 2a c0             vcvtsi2sd %eax,%xmm0,%xmm0
  97:   c5 f3 2a ca             vcvtsi2sd %edx,%xmm1,%xmm1
  9b:   c5 f9 14 c1             vunpcklpd %xmm1,%xmm0,%xmm0
  9f:   c4 e2 e9 98 c3          vfmadd132pd %xmm3,%xmm2,%xmm0
  a4:   c5 f8 29 04 c1          vmovaps %xmm0,(%rcx,%rax,8)
  a9:   48 83 c0 02             add    $0x2,%rax
  ad:   48 39 f0                cmp    %rsi,%rax
  b0:   75 d6                   jne    88 <_Z11ProcessSSE2ii+0x38>

So in short: the compiler automatically generates AVX code from the C version.

Edit after playing a bit more with flags to have SSE2 only in both cases:

g++ test.cpp -msse2 -O2 -c -o test

The compiler still does something different from what you generate with intrinsics. Compiler version:

  30:   66 0f ef c0             pxor   %xmm0,%xmm0
  34:   f2 0f 2a c0             cvtsi2sd %eax,%xmm0
  38:   f2 0f 59 c2             mulsd  %xmm2,%xmm0
  3c:   f2 0f 58 c1             addsd  %xmm1,%xmm0
  40:   f2 0f 11 04 c2          movsd  %xmm0,(%rdx,%rax,8)
  45:   48 83 c0 01             add    $0x1,%rax
  49:   48 39 c8                cmp    %rcx,%rax
  4c:   75 e2                   jne    30 <_Z11ProcessAutoii+0x30>

Intrinsics version:

  88:   66 0f ef c0             pxor   %xmm0,%xmm0
  8c:   8d 50 01                lea    0x1(%rax),%edx
  8f:   66 0f ef c9             pxor   %xmm1,%xmm1
  93:   f2 0f 2a c0             cvtsi2sd %eax,%xmm0
  97:   f2 0f 2a ca             cvtsi2sd %edx,%xmm1
  9b:   66 0f 14 c1             unpcklpd %xmm1,%xmm0
  9f:   66 0f 59 c3             mulpd  %xmm3,%xmm0
  a3:   66 0f 58 c2             addpd  %xmm2,%xmm0
  a7:   0f 29 04 c1             movaps %xmm0,(%rcx,%rax,8)
  ab:   48 83 c0 02             add    $0x2,%rax
  af:   48 39 f0                cmp    %rsi,%rax
  b2:   75 d4                   jne    88 <_Z11ProcessSSE2ii+0x38>

Compiler does not unroll the loop here. It might be better or worse depending on many things. You might want to bench both versions.

edited Dec 18 '18 at 21:18

answered Dec 18 '18 at 13:41

spectras

13,105
2
31
53

Isn't this "dangerous"? I mean: if I run that binary (generated automatically from C version) on a CPU without AVX512, it will crash, no? – markzzz Dec 18 '18 at 13:43
1

Yes it will. Specify target arch with `-march` option. You can even compile several versions and pick one at runtime depending on detected CPU features. – spectras Dec 18 '18 at 13:46
3

@markzzz It won't behave as expected, not necessarily "crash". The compiler flag `-march=native` is saying "compile this to be run on the same processor as the compiler" If you have a specific target in mind, it's worth compiling for exactly that target. – Caleth Dec 18 '18 at 13:47
Might be worth noting that due to `-march=native`, even the intrinsics version features AVX instructions around the manual intrinsics. If you want to have several versions, you must put each in a different unit and compile with appropriate flags. – spectras Dec 18 '18 at 13:51
@spectras: check the Edit. It seems that now its better than C version :) – markzzz Dec 18 '18 at 15:06
@markzzz that depends a lot on compiler choices, which in turn depend on specific version and flags. To compare the results, we'd need the disassembled loop body, as I put in my post. – spectras Dec 18 '18 at 15:22
It seems gcc way smarter than MSVC anyway hehe – markzzz Dec 18 '18 at 15:27
2

None of your asm is using AVX512. It's using AVX1 VEX encoding with 2 or 3 byte prefixes (first byte is `c4` or `c5`), not 4-byte EVEX prefixes. But yes, auto-vectorization does a better job than the manually-vectorized code that leads the compiler down the wrong path into int->fp conversion of scalars inside the loop. The problem here is really bad manual vectorization. – Peter Cordes Dec 18 '18 at 17:49
1

@markzzz: Yes, it's pretty well known that gcc optimizes better than MSVC most of the time. MSVC is often the worst out of the main 4 x86 compilers (clang, gcc, ICC, MSVC), and it has very limited choice of options for making binaries optimized for one machine. – Peter Cordes Dec 18 '18 at 17:51
@PeterCordes aside the wim's great suggestion to avoid int->fp conversion, can be the code optimized more? – markzzz Dec 18 '18 at 17:55
1

@markzzz: sure, use 2, 3 or 4 `sampleIndex_acc` variables to hide the FP-add latency and bottleneck on throughput instead. i.e. unroll with multiple accumulators. – Peter Cordes Dec 18 '18 at 17:58
@PeterCordes may I ask to give an example please? – markzzz Dec 18 '18 at 17:59
1

@markzzz: [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?](https://stackoverflow.com/q/45113527) shows an example of using multiple accumulators to gain throughput in a loop with a loop-carried dependency (a dot-product). Along with plenty of performance analysis. – Peter Cordes Dec 18 '18 at 18:34
@PeterCordes too much complicated for the knowledge I have right now! Thanks anyway ;) – markzzz Dec 18 '18 at 18:56
Oh, I just noticed that `vfmadd132sd` is *scalar double*. It generated AVX instructions of course, but it didn't auto-vectorize. Presumably manually vectorized is slower because (on Intel CPUs) `vcvtsi2sd` requires a port 1 and port 5 uop, so adding in an unpack shuffle results in a bottleneck on port 5. Otherwise it looks somewhat reasonable. Of course the OP was using MSVC, not gcc, so this answer isn't very informative. (https://godbolt.org/ has MSVC installed so you could check.) – Peter Cordes Dec 18 '18 at 21:28

Peter Cordes · Accepted Answer · 2018-12-20T16:44:15.353

Why? Is _mm_setr_pd expensive?

Somewhat; it takes at least a shuffle. More importantly in this case, computing each scalar operand is expensive, and as @spectras' answer shows, gcc at least fails to auto-vectorize that into paddd / cvtdq2pd. Instead it re-computes each operand from a scalar integer, doing the int->double conversion separately, then shuffles those together.

This is the function I'm trying to optimize:

You're simply filling an array with a linear function. You're re-multiplying every time inside the loop. That avoids a loop-carried dependency on anything except the integer loop counter, but you run into throughput bottlenecks from doing so much work inside the loop.

i.e. you're computing a[i] = c + i*scale separately for every step. But instead you can strength-reduce that to a[i+n] = a[i] + (n*scale). So you only have one addpd instruction per vector of results.

This will introduce some rounding error that accumulates vs. redoing the computation from scratch, but double is probably overkill for what you're doing anyway.

It also comes at the cost of introducing a serial dependency on an FP add instead of integer. But you already have a loop-carried FP add dependency chain in your "optimized" version that uses sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add); inside the loop, using FP += 2.0 instead of re-converting from integer.

So you'll want to unroll with multiple vectors to hide that FP latency, and keep at least 3 or 4 FP additions in flight at once. (Haswell: 3 cycle latency, one per clock throughput. Skylake: 4 cycle latency, 2 per clock throughput.) See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about unrolling with multiple accumulators for a similar problem with loop-carried dependencies (a dot product).

void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double val0 = start + step * delta;
    double deltaValue = rate * delta;

    __m128d vdelta2 = _mm_set1_pd(2 * deltaValue);
    __m128d vdelta4 = _mm_add_pd(vdelta2, vdelta2);

    __m128d v0 = _mm_setr_pd(val0, val0 + deltaValue);
    __m128d v1 = _mm_add_pd(v0, vdelta2);
    __m128d v2 = _mm_add_pd(v0, vdelta4);
    __m128d v3 = _mm_add_pd(v1, vdelta4);

    __m128d vdelta8 = _mm_mul_pd(vdelta2, _mm_set1_pd(4.0));

    double *endp = pC + blocksize - 7;  // stop if there's only room for 7 or fewer doubles
      // or use -8 and have your cleanup handle lengths of 1..8
      // since the inner loop always calculates results for next iteration
    for (; pC < endp ; pC += 8) {
        _mm_store_pd(pC, v0);
        v0 = _mm_add_pd(v0, vdelta8);

        _mm_store_pd(pC+2, v1);
        v1 = _mm_add_pd(v1, vdelta8);

        _mm_store_pd(pC+4, v2);
        v2 = _mm_add_pd(v2, vdelta8);

        _mm_store_pd(pC+6, v3);
        v3 = _mm_add_pd(v3, vdelta8);
    }
    // if (blocksize % 8 != 0) ... store final vectors
}

The choice of whether to add or multiply when building up vdelta4 / vdelta8 is not very significant; I tried to avoid too long a dependency chain before the first stores can happen. Since v0 through v3 need to be calculated as well, it seemed to make sense to create a vdelta4 instead of just making a chain of v2 = v1+vdelta2. Maybe it would have been better to create vdelta4 with a multiply from 4.0*delta, and double it to get vdelta8. This could be relevant for very small block size, especially if you cache-block your code by only generating small chunks of this array as needed, right before it will be read.

Anyway, this compiles to a very efficient inner loop with gcc and MSVC (on the Godbolt compiler explorer).

;; MSVC -O2
$LL4@Process:                    ; do {
    movups  XMMWORD PTR [rax], xmm5
    movups  XMMWORD PTR [rax+16], xmm0
    movups  XMMWORD PTR [rax+32], xmm1
    movups  XMMWORD PTR [rax+48], xmm2
    add     rax, 64                             ; 00000040H
    addpd   xmm5, xmm3              ; v0 += vdelta8
    addpd   xmm0, xmm3              ; v1 += vdelta8
    addpd   xmm1, xmm3              ; v2 += vdelta8
    addpd   xmm2, xmm3              ; v3 += vdelta8
    cmp     rax, rcx
    jb      SHORT $LL4@Process   ; }while(pC < endp)

This has 4 separate dependency chains, through xmm0, 1, 2, and 5. So there's enough instruction-level parallelism to keep 4 addpd instructions in flight. This is more than enough for Haswell, but half of what Skylake can sustain.

Still, with a store throughput of 1 vector per clock, more than 1 addpd per clock isn't useful. In theory this can run at about 16 bytes per clock cycle, and saturate store throughput. i.e. 1 vector / 2 doubles per clock.

AVX with wider vectors (4 doubles) could still go at 1 vector per clock on Haswell and later, i.e. 32 bytes per clock. (Assuming the output array is hot in L1d cache or possibly even L2.)

Even better: don't store this data in memory at all; re-generate on the fly.

Generate it on the fly when it's needed, if the code consuming it only reads it a few times, and is also manually vectorized.

It's great seeing someone more knowledgeable picking up where I stopped and going much further. That's the kind of things that makes one better one step at a time. — spectras, Dec 18 '18 at 21:48
Awesome! I'm learning SO much with your answers mate, you are so experted and kind! Thanks, thanks and thanks!!! — markzzz, Dec 19 '18 at 10:45
Does it makes sense use +=16 instead of 8? (i.e. vdelta10, vdelta12, vdelta14, vdelta16). It is using only 4 of 8 registers for SSE2 (if I'm compiling for 32bit, else I can also have 16 registers). Or is better to leave "space" for somethings other? — markzzz, Dec 19 '18 at 10:46
@markzzz: In this case you want to leave at least one register for `vdeltaX` itself. If you used all 8 for accumulators, the compiler would have to spill something, either `vdeltaX` and do `addps xmm0, [esp]` / `addps xmm1, [esp]` / ... (an extra load in ever addps), or spill/reload one of the `v0..7` accumulators, lengthening the loop-carried dep chain by a store/reload, defeating the purpose of the exercise. It doesn't have to be a power of 2, though: you could unroll with 6 or 7 accumulators, v0..v5 and `pC += 12`. Worth trying, could help. But then you prob. need to write cleanup code — Peter Cordes, Dec 19 '18 at 14:45
@PeterCordes: I see, thanks! And what about the "store final vectors"? Lets keep your code: do I need to introduce a new loop, iterate from pC to blockSize - actual position, and do each add single? Or there's a smarter way? — markzzz, Dec 19 '18 at 16:01
@markzzz: you could write a loop, but then you couldn't take advantage of the `v0..v3` that were already computed inside the main loop. (Unless you stored them to a tmp array or something.) Probably you want to fully unroll, like `if (blocksize % 8 > 2) _mm_store_pd(pC, v0); if (blocksize % 8 > 4) _mm_store_pd(pC+2, v1);` and so on. If blocksize can be odd, then you need to handle that, too. You might consider using the `c * i*scale` formula in the cleanup code if it's getting over-complicated, but as long as you can store whole vectors it's not too bad. — Peter Cordes, Dec 19 '18 at 16:08
Yes, it can also be odd. What if I simply "place" original code there? Does it will really impact performance for a "single" iteration in your opinion? — markzzz, Dec 19 '18 at 16:20
@markzzz: if blocksize is typically large, then yeah the cleanup will only be a small fraction of the time. You can certainly just use the old scalar cleanup. It's worse than taking some advantage of the already-computed `v0..v3`, but the total cleanup time should still be a small fraction of the total time. — Peter Cordes, Dec 19 '18 at 16:27
@PeterCordes blockSize will space from 80 to 256, more or less — markzzz, Dec 19 '18 at 20:52
@markzzz: oh, that's not huge. A well-optimized cleanup block would be a good idea. If you can at least leave padding so you can always store whole vectors (in this can an even number of `double`s), that helps a lot. For `c[voiceIndex]` to be 16-byte aligned, you need that anyway if it's a 2D matrix (rather than an array of pointers). — Peter Cordes, Dec 19 '18 at 21:00
@PeterCordes padding? What do you mean? Can you give a smart example as you did for this whole question? — markzzz, Dec 19 '18 at 21:02
When you define `alignas(16) double c[13][blocksize]`, if you want to use a blocksize of 125, still define it like `c[13][126]`, so each row has one unused element at the end. You can define a macro like `#define roundup2(x) ((x + 1UL) & -2UL)`, so you can do `alignas (16) double c[13][roundup2(blocksize)]`. — Peter Cordes, Dec 19 '18 at 21:06
@PeterCordes: oh I see. So better use it "more" instead of introduce branch e rework the code. Seems very smart! I'll give a try... — markzzz, Dec 20 '18 at 10:38
@PeterCordes: sure is it `double *endp = pC + blocksize - 8;`? Shouldn't be `double *endp = pC + blocksize - 7;`? — markzzz, Dec 20 '18 at 14:54
@markzzz: oh yeah, I think `-7` is right. `-8` would leave room for a full set of 4 vectors without going off the end. But depending on your cleanup code, you may actually want `-8`. The inner loop calculates a full set of results for next iteration, and you don't need that on the last iteration. — Peter Cordes, Dec 20 '18 at 16:42
@PeterCordes: I see. I've did a `#define roundintup8(x) ((x + 7) & -8)` and than `double *pCEnd = pC + roundintup8(blockSize);` as final point. This way, yes: it does a "futile" calculation in the last iteration, but I save on cleanup code (i.e. I don't need it anymore). I hope it won't take too much CPU to calculate `futile` slots. (which is the same if I have buffer % 8, anyway) — markzzz, Dec 21 '18 at 08:33
The loop-exit branch will typically mispredict on the last iteration (where it falls through instead of being taken), so there's plenty of time to run those extra ADDPD instructions that nothing is waiting for the result of anyway. Or you could round down and do a fixed block of cleanup stores, but yeah your current solution is pretty much fine, especially on modern CPUs where out-of-order execution overlaps execution of everything anyway. — Peter Cordes, Dec 21 '18 at 08:44
I don't have idea about what it is an "out-of-order execution" hehe, but I trust you ;) — markzzz, Dec 21 '18 at 09:33

Add+Mul become slower with Intrinsics - where am I wrong?

2 Answers2

Even better: don't store this data in memory at all; re-generate on the fly.

Linked

Related