Why the speedup is lower than expected by using AVX2?

Question

I have vectorized the the inner loop of matrix addition using intrinsics instruction of AVX2, I also have the latency table from here. I expect that speedup should be a factor of 5, because almost 4 latency happens in 1024 iterations over 6 latency in 128 iterations, but the speedup is a factor of 3. so the question is what else is here that I don't see. I'm using gcc, coding in c, intrinsics, CPU is skylake 6700hq

Here is c and assembly out put of the inner loop.

global data:

int __attribute__(( aligned(32))) a[MAX1][MAX2] ;
int __attribute__(( aligned(32))) b[MAX2][MAX3] ;
int __attribute__(( aligned(32))) c_result[MAX1][MAX3] ;

sequential :

for( i = 0 ; i < MAX1 ; i++)
        for(j = 0 ; j < MAX2 ; j++)
            c_result[i][j] = a[i][j] + b[i][j];

.L16:
    movl    (%r9,%rax), %edx           // latency : 2  , throughput : 0.5   number of execution unit : 4 ALU 
    addl    (%r8,%rax), %edx           // latency : dont know , throughput :    0.5     number of execution unit : 4 ALU 
    movl    %edx, c_result(%rcx,%rax)  // latency : 2 , throughput : 1  number of execution unit : 4 ALU 
    addq    $4, %rax
    cmpq    $4096, %rax
    jne .L16

AVX2:

for( i = 0 ; i < MAX1 ; i++){
   for(j = 0 ; j < MAX2 ; j += 8){
      a0_i= _mm256_add_epi32( _mm256_load_si256((__m256i *)&a[i][j]) ,  _mm256_load_si256((__m256i *)&b[i][j])); 
            _mm256_store_si256((__m256i *)&c_result[i][j], a0_i);
    }}

.L22:
    vmovdqa (%rcx,%rax), %ymm0           // latency : 3 , throughput : 0.5      number of execution unit : 4 ALU
    vpaddd  (%r8,%rax), %ymm0, %ymm0     // latency : dont know , throughput : 0.5  number of execution unit : 3 VEC-ALU
    vmovdqa %ymm0, c_result(%rdx,%rax)   // latency : 3 , throughput : 1    number of execution unit : 4 ALU
    addq    $32, %rax
    cmpq    $4096, %rax
    jne .L22

Memory alignment is 32 byte, L1D cache line size is 64 byte and 8 way, I'm still researching. But I need a professional leader, Yeah I know its Sunday. — ADMS, Mar 27 '16 at 12:59
Have you tried IACA yet? It didn't do Skylake, last I looked, but it's results on Haswell might help. Also, check out [Agner Fog's instruction tables.](http://www.agner.org/optimize/instruction_tables.pdf) — jbapple, Mar 27 '16 at 13:48
Your code is probably not compute bound but memory bound. You can't get faster than your memory bus can provide data. — Jens Gustedt, Mar 27 '16 at 14:07
@JensGustedt I think there is some cache problem, but I don't know what it is. — ADMS, Mar 27 '16 at 14:11
Is there any thing that is issued simultaneously that I don't realize? — ADMS, Mar 27 '16 at 14:13
@Amir: It's bundled with IACA, of course. IACA is closed source, IDK why you'd expect to find it on github specifically, not google. In asm, use `mov $111, %ebx` / `.byte 0x64, 0x67, 0x90` for IACA start, and the same with `$222` for IACA end. In 32bit mode, that's an illegal instruction (intentionally: clobbering ebx will break your code). In 64bit mode, it's not. (The macros expand to something else for 64bit, but `iaca` still recognizes those marks in 64bit code. So in hand-written ASM you can usually arrange things so you can leave the marks in while testing). — Peter Cordes, Mar 27 '16 at 14:40
@PeterCordes Does IACA support skylake ? I don't think so... — ADMS, Mar 27 '16 at 17:55
@Amir: nope, it's been abandoned since Haswell :( Fortunately it's still usable, since BDW/SKL didn't make major changes to the things that IACA takes into account. Some latency changes, like FMA being only 4c, and add happening in the FMA unit, and some other latency / port changes may matter for some code, but the general picture of analyzing the critical path to help you grok what's going on hasn't changed. — Peter Cordes, Mar 27 '16 at 18:10

score 3 · Accepted Answer · edited May 23 '17 at 12:33

3

Other than the loop counter, there's no loop-carried dependency chain. So operations from different loop iterations can be in flight at once. This means latency isn't the bottleneck, just throughput (of execution units, and the frontend (up to 4 fused-domain uops per clock)).

Also, your numbers are totally insane. mov loads don't take 4 ALU execution units! And the load/store latency numbers are wrong / meaningless (see the last section).

# Scalar  (serial is the wrong word.  Both versions are serial, not parallel)
.L16:
    movl    (%r9,%rax), %edx           // fused-domain uops: 1.  Unfused domain: a load port
    addl    (%r8,%rax), %edx           // fused-domain uops: 2   Unfused domain: a load port and any ALU port
    movl    %edx, c_result(%rcx,%rax)  // fused-domain uops: 2   Unfused domain: store-address and store-data ports.  port7 can't handle 2-reg addresses
    addq    $4, %rax                   // fused-domain uops: 1   unfused: any ALU
    cmpq    $4096, %rax                // fused-domain uops: 0 (fused with jcc)
    jne .L16                           // fused-domain uops: 1   unfused: port6 (predicted-taken branch)

Total: 7 fused-domain uops means the loop can issue from the loop buffer at one iteration per 2c. (not per 1.75c). Since we're using a mix of loads, stores, and ALU uops, execution ports aren't a bottleneck, just the fused-domain 4-wide issue width. Two loads per 2c and one store per 2c is only half throughput of the load and store execution units.

Note that 2-register addressing modes can't micro-fuse on Intel SnB-family. This isn't a problem for pure loads, because they're 1 uop even without micro-fusion.

The analysis is identical for the vector loop. (vpaddd has a latency of 1c on Skylake, and almost every other CPU. The table doesn't list anything in the latency column for padd with a memory operand because the latency of the load is separate from the latency of the add. It only adds one cycle to the dep chain involving the register src/dest, as long as the load address is know far enough ahead of time.)

Agner Fog's store and load latency numbers are kinda bogus, too. He arbitrarily divides the total load-store round trip latency (with store-forwarding) into a latency number for load and for store. IDK why he didn't list load latency as measured by a pointer-chasing test (e.g. repeated mov (%rsi), %rsi). That shows you that Intel SnB-family CPUs have 4 cycle load-use latency.

I meant to send him a note about that, but haven't gotten around to it.

You should be seeing an AVX2 speedup of 32/4, i.e. 8x. Your problem size is only 4096B, which is small enough for three arrays of that size to fit in L1 cache. (EDIT: the question was misleading: the loop shown is the inner loop of a nested loop. See the comments: apparently even with 4k arrays (not 4M), OP was still only seeing a 3x speedup (vs. 1.5x with 4M arrays), so there's some kind of bottleneck in the AVX version.)

All 3 arrays are aligned, so it's not cache-line crossing in the memory operand that doesn't require alignment (%r8).

My other theory on that doesn't seem very likely either, but are your array addresses offset from each other by exactly 4096B? From Agner Fog's microarch PDF:

It is not possible to read and write simultaneously from addresses that are spaced by a multiple of 4 Kbytes

The example shows a store then load, though, so IDK if that truly explains it. Even if the memory-ordering hardware thinks the load and store might be to the same address, I'm not sure why that would stop the code from sustaining as many memory ops, or why it would affect the AVX2 code worse than the scalar code.

It's worth trying offsetting your arrays from each other by an extra 128B or 256B or something.

edited May 23 '17 at 12:33

Community

1
1

answered Mar 27 '16 at 15:07

Peter Cordes

328,167
45
605
847

Thank you, I mean there is 4 ALU that can issue `mov` not take 4 ALUs – ADMS Mar 27 '16 at 15:56
1

@Amir: ALU = Arithmetic and Logic Unit. Skylake has ALUs on ports 0,1,5, and 6. `movl (%r9,%rax), %edx` is a pure load, and doesn't need an ALU. It only needs a load port, of which SnB-family CPUs have two. That's why its throughput is one per 0.5c. – Peter Cordes Mar 27 '16 at 16:13
see [64-ia-32-architectures-optimization-manual](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html) page 34 , table 2-2, row 1 : Execution Unit :ALU, # of Unit: 4, Instructions : add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*, (v)movup*. So what does it mean? – ADMS Mar 27 '16 at 16:46
1

@Amir: It means that the reg-reg form of those instructions can run on all 4 ALU ports. – Peter Cordes Mar 27 '16 at 16:51
Well, `movl (%r9,%rax), %edx` does not need ALU? – ADMS Mar 27 '16 at 16:56
1

@Amir: Correct. That's why Agner Fog's tables don't list any ALU port, only load ports, for the `mov r,m` form. – Peter Cordes Mar 27 '16 at 16:57
I think the throughput of the frontend in skylake is up to 6 fused-domain uops per clock instead of 4. So is this the same when using AVX/AVX2? Does AVX affect on cache load and store latency (when memory is aligned to 32 byte)? – ADMS Mar 28 '16 at 00:04
@Amir: Then you're wrong: it can only issue 4 fused-domain uops per clock. Maybe you're thinking of 6 insns per clock, which is possible when the 4 uops include two macro-fused compare-and-branch pairs? Read the skylake section in Agner Fog's microarch pdf. **re: load/store throughput**: It's really hard to measure store latency on its own. I guess you'd define it was how long a store keeps taking a ROB entry? But store->load forwarding latency is 4c for 32b or 64b operands, 5c for all other sizes (including 256b AVX operands). – Peter Cordes Mar 28 '16 at 04:02
1

Since you have a skylake, you could write a loop that bottlenecks on the latency of pointer-chasing via `vmovdqa (%rax), %ymm0`, `vmovq %ymm0, %rax`. (And subtract the 2c latency of the `vmovq`, up from 1c in broadwell :/) – Peter Cordes Mar 28 '16 at 04:06
You're right but in 64-ia-32-architectures-optimization-manual-> page 32-> figure 2-1->legacy decode pipeline(LDP)-> 5 uops/cycle, DSB-> 6 uops/cycle and MSROM->4 uops/cycle.So what does it mean? And does AVX reduce the LDP or DSB uops/cycle? – ADMS Mar 28 '16 at 04:53
Actually I'm comparing `a0_i= _mm256_add_epi32( _mm256_load_si256((__m256i *)&a[i][j]) , _mm256_load_si256((__m256i *)&b[i][j])); _mm256_store_si256((__m256i *)&c_result[i][j], a0_i);` and `c_result[i][j]= a[i][j] + b[i][j];` All array are aligned to 32 byte. Assembly output is in the main question. – ADMS Mar 28 '16 at 04:58
@Amir: It means decode width > issue width, to help keep the decoded uop buffer from emptying, so more cycles can issue the full 4 uops into the out-of-order pipeline. Did you even read Agner Fog's Skylake writeup like I keep telling you to if you want to understand this stuff? – Peter Cordes Mar 28 '16 at 05:06
Yeah I read it at page 217, I know what you say but Intel manual confused me. Here it is: 64-ia-32-architectures-optimization-manual->page 33 "The front end in the Skylake microarchitecture provides the following improvements over previous generation microarchitectures: • Legacy Decode Pipeline delivery of 5 uops per cycle to the IDQ compared to 4 uops in previous gener- ations. • • The DSB delivers 6 uops per cycle to the IDQ compared to 4 uops in previous generations." – ADMS Mar 28 '16 at 05:11
1

@Amir: ok. When you talk about the width of an out-of-order CPU, it usually means the width of the narrowest part, typically the issue/retirement width of the out-of-order core. The highest *sustainable* throughput for the whole pipeline. Actually sustaining it might require a special mix of instruction to not bottleneck on some other part (e.g. Core2 to IvB were 4 wide but with only 3 ALU ports, so loads and stores had to be in the mix.) The numbers you're looking at just feed the IDQ. It's weird that diagram doesn't label the arrow from IDQ into allocate/rename/... as 4uops/c. – Peter Cordes Mar 28 '16 at 05:59
1

Being able to decode more than 4 uops per clock, or fetch a full uop cache line (6 uops) per clock helps balance out cycles where fewer than 4 uops are decoded / fetched because of avoid bubbles in the frontend caused by uop cache-line boundaries, or a multi-uop instruction hitting the decoders not as the first instruction of a decode block. So it's normal that some of the fetch and decode widths are greater than the pipeline width. – Peter Cordes Mar 28 '16 at 06:02
I've added the c code to the question and I just measured the speed up it's 1.5 what is happening? (gcc -O2) – ADMS Mar 28 '16 at 06:07
1

@Amir: What are `MAX1` and `MAX2`? And what's `MAX3`? Are you bottlenecking on memory bandwidth? How many repeats of the timing loop do you do? [Is it enough for the AVX2 unit to fully warm up](http://www.agner.org/optimize/blog/read.php?i=415)? Because Skylake runs 256b instructions slowly for hundreds of thousands of cycles, or something like that. – Peter Cordes Mar 28 '16 at 06:10
`#define MAX1 1024` `#define MAX2 1024` `#define MAX3 1024` – ADMS Mar 28 '16 at 06:11
500000 repeats to measure the best time – ADMS Mar 28 '16 at 06:16
2

So that asm you showed is just the inner loop of a pair of nested loops? *Each* array is actually `4MiB = 4B*1024*1024`, not `4kiB = 4B*1024`. That's too big for L3 cache, so you're bottlenecking on main memory, duh. Even `MAX=256` is 256kiB for each array, so they still don't fit in L2. L3 cache is faster than DRAM, but nowhere *near* as fast as L1. Of course you're not seeing anything like an 8x speedup with giant arrays and that low a ratio of computation to data transfer. uop counting and all that is irrelevant when you're probably getting ~1 uop per clock. Use `perf`. – Peter Cordes Mar 28 '16 at 06:22
Thank you but I changed the MAX values to 32, 64, 128, 256 the best speedup is still less than 5, I'm evaluating AVX can not use `pref` – ADMS Mar 28 '16 at 06:34
@Amir: For AVX2, make sure it has plenty of time to warm up and decide to power on the full 256b execution units. (See the Agner Fog blog post I edited into a comment after you'd already replied to it.) 500k repeats might be enough, but IDK. It's probably good if the whole timing run takes ~1 or 2 seconds just to make sure it's not affected by any warmup issues. Also, if gcc isn't flattening the loops out into one big loop, see if it makes a difference if you use one-dimensional arrays to get a single loop instead of nested loops. – Peter Cordes Mar 28 '16 at 06:35
The Linux `perf` command for reading performance counters has no problem with AVX. Use it to see how many uops per clock you're getting, to see if it's what we expect. If not, there's another bottleneck. – Peter Cordes Mar 28 '16 at 06:37
1

@Amir: It's `perf`, not `pref`. But yes, performance counter measurements combined with microarchitecture docs are by far the best way to get to the bottom of any performance mysteries. – Peter Cordes Mar 28 '16 at 06:51
Yes but it will kill me I have many other program from matrix operation to image processing algorithm that I have implemented by AVX and AVX2 that I should analyze them. – ADMS Mar 28 '16 at 06:58
1

@Amir: just analyze this one for starters, and you can probably apply what you learned to other loops written similar ways. Or if you have any kind of test harness, you can quickly compile and test-run any of your loops to make sure it runs at a reasonable number of uops per clock. – Peter Cordes Mar 28 '16 at 07:02

ADMS · Answer 2 · 2016-03-30T11:36:57.133

0

Following limitation restrict the performance of two implementation. First, other than the loop counter, there's no loop-carried dependency chain thus operations from different loop iterations can be performed at once and this means latency isn't the main bottleneck how ever latency is an important factor in HPC. Since, latencies are some equal, throughput of execution units is more effective for both implementations. IACA demonstrate the throughput bottleneck for scalar implementation as “Inter-Iteration” that means there is a dependency between consecutive iterations of the loop and vectorization helps make the code run faster.furthermore, vpaddd in vectorized mode can be issued on ports 5,1 but add uses execution ports 1,5,6 when port 0 is busy in the first cycle. Second, the throughput of the front-end of fused-domain may affect the performance but, in this algorithm according to the IACA results for both implementations 7 uops for each iteration needed and HSW/SKL micro-architecture can issue up to 4 fused-domain uops per clock thus it needs 2 cycle per iteration of the inner loop and this limitation violate AVX2 implementation more than scalar implementation. Third, data dependency of the algorithm cause many cache misses. By reducing the size of the matrices to be fit into the L1D(first level data cache) becomes a factor of 5(how ever I tested many time to get 5 but IDK tested again speedup is 7.3).

edited Mar 30 '16 at 11:36

answered Mar 29 '16 at 08:25

ADMS

117
3
18

1

Interesting that you got a factor of 5 speedup, not 8, since the latencies and uops are the same for scalar vs. AVX2. Also note that IACA's `total` is unfused-domain uops, which isn't a useful thing to total. (e.g. xor-zeroing and eliminated moves are counted as zero). In your case, the answer is the same because none of your uops can micro-fuse, only macro-fuse. – Peter Cordes Mar 29 '16 at 08:54
Anyway, Intel's optimization manual, in Section 2.1.3, gives a table of peak vs. sustained throughput for L1, L2, etc on Skylake. Skylake can only sustain ~81B/cycle total to/from L1D cache. (The Haswell table doesn't have that column. IDK if that means sustained=peak or not). However, I just realized **this doesn't explain anything about scalar vs. vector for your loop, because the frontend limits your code to 96B per 2 cycles**. I thought for a minute I'd found an explanation, but I guess not. – Peter Cordes Mar 29 '16 at 08:58
vpaddd in vectorized mode can be issued on ports 5,1 but add uses execution ports 1,5,6 when port 0 is busy in the first cycle. just added to the answer – ADMS Mar 29 '16 at 09:04
So what? Your loop executes one `add` or `vpaddd` per two clocks, plus another `add` as part of the loop overhead. There's never any contention for execution units, except maybe for ports 2 and 3, since two-register store addresses need their AGU for store-address uops. port 7 can only handle simple effective addresses. – Peter Cordes Mar 29 '16 at 09:07
AVX nee0d to load/store 32B from L1D, does 81B per cycle make sense? because I read some where that previous CPUs can only load/store 128 bit per cycle and AVX instruction need 2 cycle to load their data from L1D but I don't know what is 81B. – ADMS Mar 29 '16 at 09:15
1

I don't know what limits sustained throughput to 81B either. Presumably it's an experimental measurement. The peak throughput is still listed as the expected 96B per cycle (2x32B load, 32B store). Intel CPUs since Haswell have had 256b data paths in the cache subsystem. – Peter Cordes Mar 29 '16 at 09:18
Yes I mean if operations from different loops are issued so scalar mode has more port than AVX2. because when I use float data type almost all the execution port are the same I mean port 6 can not issue float and other ports issue it as an AVX scalar instruction instead of x87 floating point. So my speed up on float is a factor of 8 and when data is fit to L1D speedup is up to 8 some time 10 or 11. I think port 6 containing int ALU doing some thing to execute `add`s faster than `vadd`s – ADMS Mar 29 '16 at 09:23
I think 81B is real because when all ports 2,3,4 are busy just port 7 can be used to storing addresses how ever port2,3 can store address when possible – ADMS Mar 29 '16 at 09:26
1

Execution units are fully pipelined, so they aren't "occupied" until an insn retires or anything like that. Neither of your loops put any significant pressure on any ALU execution ports. In dense code with a higher ratio of computation to memory access, then yes, it matters that `add` has a throughput of one per 0.25c vs. `vpaddd` only being able to execute to ports 0/1/5. Also: re: 81B/clock, good point: store-address uops stealing port 2,3 cycles does happen, and that's probably what limits the sustained throughput. IDK if that's what you were trying to say. – Peter Cordes Mar 29 '16 at 09:31
My float program's speedup is 11 that I use AVX for the same algorithm for 128x128 matrix. I mean the differences between these can be explained, AVX uops are slower than integer I mean load/store/add/ and etc but now where I can see this except Intel intrinsic guide latencies is different from other documentions – ADMS Mar 29 '16 at 09:39
Over x87 floating point? For this algorithm? I don't think there's anything particularly bad with x87 for load/add/store. Check it out yourself with IACA and/or perf counters. – Peter Cordes Mar 29 '16 at 09:43
No I meant for floating point data type both scalar and AVX uses AVX instruction because its better than to use x87. but for integer one thing that I think can explain this 5 speed up is that vector instruction are much slower than int that issued in INT-ALU. I mean VEC-ALU is slower than INT-ALU, in total VEC ins even for scalar operation are slower than INT ins that can be used only for scalar 32 or 64 bits data – ADMS Mar 29 '16 at 09:49
As I tested just x87 ins used for `long double` in `gcc -O2` all `float` and `double` data use VEC-ALUs – ADMS Mar 29 '16 at 10:07
I got a 7.3 speedup for a 32x32 matrix so speedup 5 violate from cache missing and I think load and store ins restricted this loop and speedup is not a factor of 8, AVX load 256 bit and store 256 bit so datapath is fully used for an iteration how ever its 2 cycle, and may be address storing slightly reduce the performance but in 32 bit loading ins it do not cause any problem. – ADMS Mar 29 '16 at 11:42

Why the speedup is lower than expected by using AVX2?

2 Answers2

Linked