Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

Question

I am computing eight dot products at once with AVX. In my current code I do something like this (before unrolling):

Ivy-Bridge/Sandy-Bridge

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {        
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0); 
}

Haswell

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {      
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_fmadd_ps(arge0, breg0, tmp0);
}

How many times do I need to unroll the loop for each case to ensure maximum throughput?

For Haswell using FMA3 I think the answer is here FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I need to unroll the loop 10 times.

For Ivy Bridge I think it's 8. Here is my logic. The AVX addition has a latency of 3 and the multiplication a latency of 5. Ivy Bridge can do one AVX multiplication and one AVX addition at the same time using different ports. Using the notation m for multiplication, a for addition, and x for no operation as well as a number to indicate the partial sum (e.g. m5 means multiplication with the 5th partial sum) I can write:

port0:  m1  m2  m3  m4  m5  m6  m7  m8  m1  m2  m3  m4  m5  ... 
port1:   x   x   x   x   x  a1  a2  a3  a4  a5  a6  a7  a8  ...

So by using 8 partial sums after nine clock cycles (four from the load and five from the multiplication) I can submit one AVX load, one AVX addition and one AVX multiplication every clock cycle.

I guess this means that it's impossible to achieve the maximum throughput for this tasks in 32-bit mode in Ivy Bridge and Haswell since 32-bit mode only has eight AVX registers?

Edit: In regards to the bounty. My main questions still hold. I want to get the maximum throughput of either the Ivy Bridge or Haswell functions above, n can be any value greater than or equal to 64. I think this can only be done using unrolling (eight times for Ivy Bridge and 10 times for Haswell). If you think this can be done with another method then let's see it. In some sense this is a variation of How do I achieve the theoretical maximum of 4 FLOPs per cycle?. But instead of only multiplication and addition I'm looking for one 256-bit load (or two 128-bit loads), one AVX multiplication, and one AVX addition every clock cycle with Ivy Bridge or two 256-bit loads and two FMA3 instructions per clock cycle.

I would also like to know how many registers are necessary. For Ivy Bridge I think it's 10. One for the broadcast, one for the load (only one due to register renaming), and eight for the eight partial sums. So I don't think this can be done in 32-bit mode (and indeed when I run in 32-bit mode the performance drops significantly).

I should point out that the compiler can give misleading results Difference in performance between MSVC and GCC for highly optimized matrix multplication code

The current function I'm using for Ivy Bridge is below. This basically multiplies one row of a 64x64 matrix a with all of a 64x64 matrix b (I run this function 64 times on each row of a to get the full matrix multiply in matrix c).

#include <immintrin.h>
extern "C" void row_m64x64(const float *a, const float *b, float *c) {      
    const int vec_size = 8;
    const int n = 64;
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[0*vec_size]);
    tmp1 = _mm256_loadu_ps(&c[1*vec_size]);
    tmp2 = _mm256_loadu_ps(&c[2*vec_size]);
    tmp3 = _mm256_loadu_ps(&c[3*vec_size]);
    tmp4 = _mm256_loadu_ps(&c[4*vec_size]);
    tmp5 = _mm256_loadu_ps(&c[5*vec_size]);
    tmp6 = _mm256_loadu_ps(&c[6*vec_size]);
    tmp7 = _mm256_loadu_ps(&c[7*vec_size]);

    for(int i=0; i<n; i++) {
        __m256 areg0 = _mm256_set1_ps(a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[vec_size*(8*i + 0)]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[vec_size*(8*i + 1)]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[vec_size*(8*i + 2)]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[vec_size*(8*i + 3)]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[vec_size*(8*i + 4)]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[vec_size*(8*i + 5)]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[vec_size*(8*i + 6)]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[vec_size*(8*i + 7)]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[0*vec_size], tmp0);
    _mm256_storeu_ps(&c[1*vec_size], tmp1);
    _mm256_storeu_ps(&c[2*vec_size], tmp2);
    _mm256_storeu_ps(&c[3*vec_size], tmp3);
    _mm256_storeu_ps(&c[4*vec_size], tmp4);
    _mm256_storeu_ps(&c[5*vec_size], tmp5);
    _mm256_storeu_ps(&c[6*vec_size], tmp6);
    _mm256_storeu_ps(&c[7*vec_size], tmp7);
}

Be aware that unrolling small loops can have a negative effect on performance on Core i7 and later. The Loop Stream Detector can only cache 28 µops on Nehalem - I'm not sure if that size has increased in Ivy Bridge/Haswell. — Paul R, Jan 13 '14 at 12:21
See also: http://stackoverflow.com/questions/3080848/how-many-times-should-a-loop-be-unwinded and http://stackoverflow.com/questions/2349211/when-if-ever-is-loop-unrolling-still-useful — Paul R, Jan 13 '14 at 12:25
That's worth looking into. Let me read up on that. I unroll by eight currently because it gives the best result. I got there empirically (I tried unrolling 2, 4, 8, 16) . Now I want to understand the theory. — Z boson, Jan 13 '14 at 12:26
What exactly do you aim to achieve with unrolling, avoid easily predictable branches? The OOO is able to execute across multiple iterations, you should still be bounded only on the `tmp0` dependency — Leeor, Jan 13 '14 at 13:09
@PaulR, I checked Fog's microarchieture manual. The µop cache in SandyBridge/Ivybridge is a "maximum capacity of 1536 μops." That's quite large. When it comes to dependency chains the only way to get the maximum throughput is to unroll the loop. You might have a case that too much unrolling has a negative effect but there is no question that unrolling a few times in dependency chain is still helpful. I doubt unrolling 10 times stresses the µop cache. — Z boson, Jan 13 '14 at 13:09
He wasn't talking about the uop cache, the loop stream detector is much smaller (and more effective in terms of bandwidth) — Leeor, Jan 13 '14 at 13:11
@Leeor is correct - the LSD has a small µop cache of its own which enables small loops to be optimised. This is separate from the instruction decoder's µop cache. — Paul R, Jan 13 '14 at 13:19
okay, let me look into the LSD. But I know for a fact that unrolling a few times in dependency chains is is still quite helpful. I'm surprised that's even being debated here. Maybe I misunderstand what you both mean but it sounds as if both of you think any unrolling on Sandy Bridge and later is unnecessary to even bad. — Z boson, Jan 13 '14 at 13:19
Loop unrolling (either manually or letting the compiler do it for you) is fine if you know what you're doing and are aware of all the trade-offs. For very small loops as in your example you need to be aware of the impact of spilling out of the LSD µop cache - that's all. — Paul R, Jan 13 '14 at 13:21
@PaulR, ignoring the LSD do you think my conclusion of unrolling 10 times with Ivy Bridge is correct? — Z boson, Jan 13 '14 at 13:24
In my experience unrolling beyond 4x is rarely beneficial, but I suggest you try it - make sure you disable compiler loop unrolling and any other optimisations which might get in the way, then try benchmarking with various loop unrolling factors. — Paul R, Jan 13 '14 at 13:26
I'm also questioning the necessity of unrolling a dependency chain of such long latency operations like FMAs. The benefit of having the index ready statically is negligible, your performance is governed by the vector unit bandwidth/latency - calculating the indices (and predicting some branches) can be done in parallel with no impact. Only profiling can tell who's right though. — Leeor, Jan 13 '14 at 13:28
@PaulR, okay, I found your discussion of the LSD in Fog's manaul. "It is advantageous to make loops smaller than 28 μops in order to take advantage of the loop buffer." under a section called the Lookback Buffer. He says it has not changed in SandyBridge/IvyBridge. — Z boson, Jan 13 '14 at 13:29
@PaulR As far as I can tell , at least for Ivy Bridge, I can unroll at least eight times (maybe even nine) and still fit in the 28µop cache. The broadcast only has to be done once in the loop. So it's one load, one multiplication, and one add which are each one µop (3 total). I have not checked fma3 yet - the LSD chanced in Haswell. — Z boson, Jan 13 '14 at 13:59
I'd be interested to see actual measured benchmark data for unroll factor versus throughput once you have it. — Paul R, Jan 13 '14 at 14:58
@Leeor, you wrote "The OOO is able to execute across multiple iterations, you should still be bounded only on the tmp0 dependency." Are you saying that I should only unroll by three then (the latency of the addition)? There is no way out of the dependency as far as I can see except to break the dependency by unrolling. The question is how much should I unroll. I first thought 3, then 5, then 8, now 10. Now that I know about the LSD I can see there is another limiting factor. — Z boson, Jan 13 '14 at 15:02
@PaulR, all I can say right now is that when I wrote my GEMM code I unrolled by 2,4,8,16 (because that fit well with how I was doing tiling and caching) and eight worked the best. In any case, thank you for telling me about the LSD. That's useful information. But I'm not satisfied with the answers in the comments so far. I think the answer, at least for Ivy Bridge is around eight. I'll make this a bounty question if I need to. — Z boson, Jan 13 '14 at 15:08
@Zboson If you try hard enough, it's possible to get max throughput with very small number of registers. But you have to "game" the register renaming. I had an example where I was able to saturate the multiplier using only 2 SIMD registers. I bet it's possible to saturate both FMA units using as few as 3 registers. That said, if you want to do anything particularly useful, you're gonna need more than that. — Mysticial, Jan 13 '14 at 18:26
As your code is currently written, it's impossible to get more than half the FMA throughput regardless of how it's unrolled. Haswell can do 2 FMAs/cycle as well as 2 loads/cycle. But you have 2 loads for every FMA. So you will saturate your load units. — Mysticial, Jan 13 '14 at 18:28
And even if you get the ratio to be 1:1. You're trying to push 4 instructions/cycle. Haswell is designed to sustain only about 4/cycle. (though it can go higher in small bursts) That leaves no room for overhead like loop counters and branching. — Mysticial, Jan 13 '14 at 18:31
@Mysticial, thanks for you comment. Let me look into register renaming again...as to your comment about 2 loads per FMA I think it's only one. The broadcast `_mm256_set1_ps(a[i])` only has to be done once in the loop so if I unroll a few times I only have to load from the `b` array for each partial sum. But maybe I misunderstand what you mean. I will update my question tomorrow. I think I will make this a big bounty question (250-500 points - why not) in case you're interested. It's the critical part of my GEMM code. — Z boson, Jan 13 '14 at 19:02
@Mysticial, I think I see what you mean. It can do two 128-bit loads/cycle but only one 256-bit load/cycle. I don't know how to fix that. I updated my question with my inner loop. — Z boson, Jan 13 '14 at 19:33
That's only for Sandy/Ivy Bridge. Haswell has can do two 256-bit loads/cycle. I was referring to your original loop where (I thought) you had a broadcast and a load for each FMA. — Mysticial, Jan 13 '14 at 19:34
@PaulR, is the compiler even allowed to auto unroll this loop? I think unrolling breaks IEEE floating point rules (unless I allow a relaxed floating point model, e.g. with `-ffast-math`, which I don't) due to floating point not being associative. — Z boson, Jan 14 '14 at 10:30
It depends on what you mean by unrolling, but yes, it can unroll the loop in a basic way, which reduces loop overhead and may give better instruction scheduling, but it can't do the kind of loop unrolling that you're probably talking about, where you have multiple intermediate terms. — Paul R, Jan 14 '14 at 10:58
@PaulR, for haswell I the best speedup unrolling 10 times. See my answer. — Z boson, Feb 06 '14 at 19:48
I'd like to see the graph of throughput versus unroll factor - I would guess that it's fairly flat around 10x, i.e. a fairly broad "sweet spot" with not a great deal of change beyond, say 8x ? — Paul R, Feb 06 '14 at 21:47
In my experience the best way to unroll a loop of `for (data) { load; process; stores; }` is to rearrange to move the data dependencies as far apart as possible: `load 0; process 0; for (data 1..) { load n; process n; store n - 1; } store n`. The compiler will schedule the dependencies in the loop (and will probably move the stores up) but it will have an easier time because the stores are essentially "filler" in the loop. — Ben Jackson, Feb 10 '14 at 07:35

score 16 · Accepted Answer · answered Feb 04 '14 at 17:32

16

For Sandy/Ivy Bridge you need to unroll by 3:

Only FP Add has dependency on the previous iteration of the loop
FP Add can issue every cycle
FP Add takes three cycles to complete
Thus unrolling by 3/1 = 3 completely hides the latency
FP Mul and FP Load do not have a dependency on the previous iteration and you can rely on the OoO core to issue them in the near-optimal order. These instructions could affect the unroll factor only if they lowered the throughput of FP Add (not the case here, FP Load + FP Add + FP Mul can issue every cycle).

For Haswell you need to unroll by 10:

Only FMA has dependency on the previous iteration of the loop
FMA can double-issue every cycle (i.e. on average independent instructions take 0.5 cycles)
FMA has latency of 5
Thus unrolling by 5/0.5 = 10 completely hides FMA latency
The two FP Load microoperations do not have a dependency on the previous iteration, and can co-issue with 2x FMA, so they don't affect the unroll factor.

answered Feb 04 '14 at 17:32

Marat Dukhan

11,993
4
27
41

I get the logic. It's not clear to me how the OoO logic takes care of this but I know what to do in the code. I'll have to read the Intel manuals to understand more. The fact that I unroll 8 times probably gives the best result due to how I have arranged the memory and the 64x64 matrix. 32-bit is slower because I have required more than eight registers. But this still means with Haswell it's impossible to get the max throughput in 32-bit mode. – Z boson Feb 05 '14 at 09:58
I just realized that I'm not actually unrolling the loop in my question. I'm doing eight 8-wide dot products (64 dot products). I'm reading/writing one row of the 64 side matrix using eight AVX accumulators. This does not really change the question though. It just becomes how many accumulators are necessary for maximum throughput and the answer is the same. – Z boson Feb 06 '14 at 11:57
[Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](//stackoverflow.com/q/45113527) shows that using more accumulators than the bare minimum helps. Apparently scheduling isn't always perfect, and/or absorbing memory latency variations works better with more accumulators. – Peter Cordes Oct 29 '19 at 23:31

score 7 · Answer 2 · edited May 23 '17 at 12:02

I'm only answering my own question here to add information.

I went ahead and profiled the Ivy Bridge code. When I first tested this in MSVC2012 unrolling by more than two did not help much. However, I suspected that MSVC did not implement the intrinsics optimally based on my observation at Difference in performance between MSVC and GCC for highly optimized matrix multplication code. So I compiled the kernel in GCC with g++ -c -mavx -O3 -mabi=ms, converted the object to COFF64 and dropped it into MSVC and I now get that unrolling by three gives the best result confirming Marat Dunkhan's answer.

Here are the times in seconds, Xeon E5 1620 @3.6GHz MSVC2012

unroll    time default            time with GCC kernel
     1    3.7                     3.2
     2    1.8 (2.0x faster)       1.6 (2.0x faster)
     3    1.6 (2.3x faster)       1.2 (2.7x faster)
     4    1.6 (2.3x faster)       1.2 (2.7x faster)

Here are the times on i5-4250U using fma with GCC in Linux (g++ -mavx -mfma -fopenmp -O3 main.cpp kernel_fma.cpp -o sum_fma)

unroll    time
     1    20.3
     2    10.2 (2.0x faster)
     3     6.7 (3.0x faster) 
     4     5.2 (4.0x faster)
     8     2.9 (7.0x faster)
    10     2.6 (7.8x faster)

The code below is for Sandy-Bridge/Ivy Bridge. For Haswell use e.g. tmp0 = _mm256_fmadd_ps(a8,b8_1,tmp0) instead.

kernel.cpp

#include <immintrin.h>

extern "C" void foo_unroll1(const int n, const float *b, float *c) {      
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=8) {
        __m256 b8 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8), tmp0);
    }
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll2(const int n, const float *b, float *c) {
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=16) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
    }
    tmp0 = _mm256_add_ps(tmp0,tmp1);
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll3(const int n, const float *b, float *c) { 
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 tmp2 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=24) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
        __m256 b8_3 = _mm256_loadu_ps(&b[i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(a8,b8_3), tmp2);
    }
    tmp0 = _mm256_add_ps(tmp0,_mm256_add_ps(tmp1,tmp2));
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll4(const int n, const float *b, float *c) {      
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 tmp2 = _mm256_set1_ps(0.0f);
    __m256 tmp3 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=32) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
        __m256 b8_3 = _mm256_loadu_ps(&b[i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(a8,b8_3), tmp2);
        __m256 b8_4 = _mm256_loadu_ps(&b[i + 24]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(a8,b8_4), tmp3);
    }
    tmp0 = _mm256_add_ps(_mm256_add_ps(tmp0,tmp1),_mm256_add_ps(tmp2,tmp3));
    _mm256_storeu_ps(c, tmp0);
}

main.cpp

#include <stdio.h>
#include <omp.h>
#include <immintrin.h>

extern "C" void foo_unroll1(const int n, const float *b, float *c);
extern "C" void foo_unroll2(const int n, const float *b, float *c);
extern "C" void foo_unroll3(const int n, const float *b, float *c);
extern "C" void foo_unroll4(const int n, const float *b, float *c);

int main() {
    const int n = 3*1<<10;
    const int r = 10000000;
    double dtime;
    float *b = (float*)_mm_malloc(sizeof(float)*n, 64);
    float *c = (float*)_mm_malloc(8, 64);
    for(int i=0; i<n; i++) b[i] = 1.0f;

    __m256 out;
    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll1(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll2(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll3(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll4(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");
}

I'm not certain if it is helpful, but you could add a __restrict to parameter b and c. — , Feb 07 '14 at 23:17
Why are there variables used for b8_1...b8_4 as these expressions are used only once? — , Feb 07 '14 at 23:23
when profiling, how do you make sure the cpu "turbo speed" feature that would effect the execution time, vis a vis modulated clock frequecy? — codechimp, Apr 02 '15 at 00:58
@Z boson FYI, the code doesn't execute exactly as expected. Because A8 is set to all 1's, GCC optimizes away the_mm256_mul_ps operation. Changing A8 some other value, like 5.0f, is required. for unroll1, the missing MULPS step doesn't impact total time but does have a noticable impact at unroll4. — codechimp, Apr 02 '15 at 20:16
@user4602856, interesting observation. I did most of this with MSVC which may not have optimized the 1`s away (I have since stopped using MSVC). I may look into this again... — Z boson, Apr 12 '15 at 19:56

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

2 Answers2

Linked

Related