Intel Auto-Vectorization Trip Count Explanation?

Question

I've done quite a bit of thread-level and process-level parallelism and now I'm trying to get into instruction level parallelism with the Intel C++ Compiler which is being quite a challenge.

While doing some auto-vectorization of loops and analysing the compiler logs I found some "Estimate of max trip count of loop" that I can't quite figure out.

Example:

double a[100],x[100],y[100]
...
for (i=0; i< 100; i++) {
   a[i] = x[i] + y[i];
}

This loop outputs a Estimate of max trip count of 12 trips. I read somewhere that the vectorisation process can process a total of 8 elements per trip, as long has the cost of the process of each cycle is less that 6 u-operations, from what i can tell, this example loop has a cost of 1 store, 2 reads and 1 arithmetic operation.

So in theory, my trip count should be 100/8 = 12.5 trips, therefore, 13 trips.

Is this a round up made by the compiler? Or is there any other optimization going on in the background that allows the process to take less than 13 trips?

One more question, is my 6 u-operations per cycle assumption correct? Are there any cases when this does not apply?

Thanks in advance

Would you mind adding a. code I can compile, b. the command line you use to compile, b. OS/compiler version information, c. the diagnostics that you are referring to? NB: Both vectorization and vectorisation are correct (the first being a bit more common imho), just try to stick to one :) — marc, Oct 23 '15 at 14:14

score 3 · Accepted Answer · edited May 23 '17 at 11:54

3

Rather than get caught up in how Intel implements each loop let's try and answer your question about instruction level parallelism.

Your operations is limited by the read and writes so you can ignore the arithmetic in determining the number of cycles. Here is what Core2 through Broadwell can do:

Core2:   two 16 byte reads one 16 byte write per 2 clock cycles     -> 24 bytes/clock cycle
SB/IB:   two 32 byte reads and one 32 byte write per 2 clock cycles -> 48 bytes/clock cycle
HSW/BDW: two 32 byte reads and one 32 byte write per clock cycle    -> 96 bytes/clock cycle

The total number of bytes being read and written is sizeof(double)*100*3=2400. So a quick estimate of the time it will take is

Core2:   2400/24 = 100 clock cycles
SB/IB:   2400/48 =  50 clock cycles
HSW/BDW: 2400/96 =  25 clock cycles

Now the question is how to implement this for full bandwidth.

For Core2 through Ivy Bridge one of the loads can be fused with one of the addition to cost one micro-fused micro-op. The other load costs one micro-op and the load one micro-op. If you want to do this every iteration you need to decrease a pointer and do a conditional jump as well. Since Nehalem, these can macro-fuse so the total number of micro-fused/macro fused operations per iteration is:

                            Core2          Nehalem through Broadwell
vector add + load               1          1
vector load                     1          1
vector store                    1          1
scalar add                      1          ½
conditional jump                1          ½  
--------------------------------------------
total                           5          4

For Core2 through Ivy Bridge either both loads need the same port or the load and store need the same port. This requires two clock cycles. For Haswell/Broadwell it's possible to due this every clock-cycle. However, due to limitations on port 7 only statically allocated arrays can achieve this using absolute-32 bit address + offset addressing (which incidentally is not possible on OSX). So for Haswell/Broadwell if the array is not statically allocated you either have to unroll the loop to do your operation every clock cycle or it takes 1.5 clock cycles per iteration. Here is a summary of the clock cycles per iteration for each processor:

Core2:   5 fused micro-ops/every two clock cycles
SB/IB:   4 fused micro-ops/every two clock cycles
HSW/BDW: 4 fused mirco-ops/every clock cycle for statically allocated array
HSW/BDW: 4 fused mirco-ops/every 1.5 clock cycles for non-statically allocated arrays

If you used stack allocated arrays you can probably safely read past the end of the buffer. Otherwise you should pad your arrays to the SIMD width. The number of iterations of the loop is then:

SSE2: (100+1)/2 = 51
AVX:  (100+3)/4 = 26

The Intel compiler in my experience unrolls twice so that would half the number of iterations. The number of iterations unrolling twice is

SSE2: (100+3)/4 = 26
AVX:  (100+7)/8 = 13

Finally, in terms of clock cycles it's

Core2:     51*2   = 102 clock cycles
SB/IB:     26*2   =  51 clock cycles
HSW/BDW:   26*1.5 =  39 clock cycles for non-statically allocated arrays no-unroll
HSW/BDW:   26*1   =  26 clock cycles for statically allocated arrays no-unroll
HSW/BDW:   26*1   =  26 clock cycles with full unrolling

edited May 23 '17 at 11:54

Community

1
1

answered Oct 26 '15 at 11:11

Z boson

32,619
11
123
226

Port7 can handle single-register addressing modes. gcc can compile array references in a loop to incrementing pointers, instead of using [r + r*scale] addressing modes. So haswell and later CPUs are fine. Also note that SnB/IvB can only achieve 2*16B read + 16B write per clock cycle if some of the operations are 32B. Oh, the OP is using `double`, not an integer type, so AVX1 without AVX2 will be able to do the 32B ALU op. – Peter Cordes Oct 27 '15 at 14:44
@PeterCordes, what exactly due you mean by "gcc can compile array references in a loop to incrementing pointers, instead of using [r + r*scale] addressing modes". I'm a bit out of practice with assembly and addressing modes. – Z boson Oct 27 '15 at 14:49
@PeterCordes, okay I think I remember this issue now. The problem is I have three arrays so I either use three different base pointers and increment an index register. Or I don't use an index register and instead I have to increment each separate base pointer. The first method requires one scalar add and the second solutions requires three scalar adds. The second method requires six micro-ops per iterations and the max is four. That's the problem. But maybe it's still more efficient than not being able to use port7. I think I tried this and found it was worse but it's been over a year now... – Z boson Oct 27 '15 at 15:00
1

@PeterCordes, I'm referring to the triad function I did [here](http://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62). In the OPs case he only has two base pointers so I guess it only needs five micro-ops per iteration. That's still more than 4 but maybe it's more efficient than not using port7. I did not think of that. – Z boson Oct 27 '15 at 15:02
https://goo.gl/VXdA86. `float sumf(float *x) { ... for(int i=0; i<2048; i++) sum += x[i]; }` compiles to a scalar loop that does `addss xmm0, DWORD PTR [rdi] / add rdi, 4 / cmp / jne`, comparing against the end address. You'd need gcc to unroll a bit to amortize the pointer increments. You could just increment the store pointer but use indexed addressing for the load pointers. If the offsets of your arrays relative to each other can be compile-time constants, you could use `[rdi]`, `[rdi + 8192]`, `[rdi + 65536]` or something as your three addresses. But avoid cache-bank conflicts! – Peter Cordes Oct 27 '15 at 15:05
You're right, loop overhead makes it hard to saturate L1 cache bandwidth in practice without unrolling. – Peter Cordes Oct 27 '15 at 15:09
@PeterCordes `[rdi], [rdi + 8192], [rdi + 65536]` that's a great idea! I did not think of that. But I don't know what the code you posted above is referring to or what I should infer from it. Neither the triad function or the OPs function are doing a reduction. – Z boson Oct 27 '15 at 15:10
@PeterCordes, the whole point in my triad question was **not** to unroll on haswell. The point is that for core2 through ivy bridge you don't need to unroll to get almost 100% efficiency. – Z boson Oct 27 '15 at 15:12
@PeterCordes, the point of my triad question was do do the operation every iteration without unrolling. In order to do this it can only use 4 fused micro-ops total and it has to use port 7. – Z boson Oct 27 '15 at 15:13
@PeterCordes, opps, I said the OPs question only needed two base pointers. In fact it needs three just like the triad function. The only difference is the OPs question only does an addition whereas the triad function does addition and multiplication. – Z boson Oct 27 '15 at 15:14
My point in linking that code on godbolt was to show that gcc can turn a loop over array indices into a pointer-increment. A reduction is a simpler example because there's only one array. Also: re not unrolling: Last time I tested, my Sandybridge had a taken-branch throughput of one per 2 clocks in a loop, so any loop shorter than 8 uops might as well have been 8 uops. I *might* be misremembering this. I'm pretty sure Haswell can do a 4uop loop at one per clock, though. Agner Fog lists `jmp` as one per 2 clocks on SnB, one per 1-2 clocks on HSW, so that fits. – Peter Cordes Oct 27 '15 at 15:16
@PeterCordes, okay, I learned something from this. I could have used `lea, r8, [rdx+rax]` and then `vmovaps [r8], ymm1`. That would have used port 7. This would require 5 micro-ops instead of 4 but it probably would have been more efficient that way. I'll have to redo my tests soon. – Z boson Oct 27 '15 at 15:23
That `[rdi]`, `[rdi + 8192]` hack is something I saw before in another form here on SO. It would be useful with a struct-of-arrays, if the arrays really were in a struct and had compile-time-const size. Maybe with a register holding the difference between the start of two arrays? Oh, I think it was a hack for using shorter instruction encodings with static arrays: `[rdi]`, `[rdi + B - A]`, or something like that. So it could use an 8-bit displacement instead of 32. Hmm, I can't quite reconstruct the original context where I saw this idea. – Peter Cordes Oct 27 '15 at 15:25
@PeterCordes, your `[rdi], [rdi + 8192], [rdi + 65536]` trick, sadly, won't help. The problem is that `rdi` will be an address and not an index so I can't count down to zero with it. I would need a `cmp` instruction. This problem is much harder than it seems. I made a 500 point bounty for it and it got a lot of attention. The only solution I found was the one with static arrays using `[absolute 32-bit address + index]`. Perhaps a best compromise is to use an additional `lea` for the store. That's find for ports but it means 5 fused micro-ops and 4 is the max per clock cycle. – Z boson Oct 28 '15 at 07:52
Yeah, I'm aware that Intel's Core2 and onward designs have a 4-wide issue width, and that it's often the bottleneck in hand-tuned code. It's hard to do much in a 4uop loop, so I haven't been focusing on that in this discussion. It's not a lot of extra code size (or uop-cache usage) to unroll to an 8 or 12 uop loop. Then you aren't bottlenecked by `jmp` throughput on SnB/IvB. (although you're still bottlenecked by half-throughput for 32B loads/stores on those CPUs, not to mention probable cache-bank conflicts). Also, I've been talking about looping over arrays in general, not just triad. – Peter Cordes Oct 28 '15 at 08:54
One thought: the `vmovaps` load can use a 2-register addressing mode, because it doesn't have to micro-fuse. The ALU op with a memory operand, and the store, *do* have to micro-fuse to save fused-domain uops. I wonder if anyone's tested skylake to see if it can micro-fuse 2-register addressing modes, or if that's ever going to happen? As you say, it's highly inconvenient to use one-register addressing modes without unrolling. – Peter Cordes Oct 28 '15 at 09:00
1

@PeterCordes, so even if I use `lea` for the store I still have the problem that the ALU with the load uses two register addressing mode so it won't fuse. And another `lea` won't help obviously...so I think using a `lea` for the store actually is 6 fused micro-ops since the `vfmadd231ps ymm1, ymm2, [rsi+rax]` can't fuse. – Z boson Oct 28 '15 at 09:38
@PeterCordes, I confirmed this in IACA. Nobody every pointed out that `vfmadd231ps ymm1, ymm2, [rsi+rax]` was not fusing. That's another major problem. It was not just a port 7 problem. I only realized this at the end of my triad question which is why I asked this question [micro-fusion-and-addressing-modes](http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes) and it took a long time before somebody (you) confirmed it. I tried `lea` for the store but it only helps a little. The problem is the ALU + load does not fuse. – Z boson Oct 28 '15 at 11:28
@PeterCordes, I tried your `[rdi], [rdi + 8192], [rdi + 65536]`. It does help a lot more that `lea`. It's true I still need a `cmp` instruction but now the `vfmadd231ps ymm1, ymm2, [rs+offset]` fuses. The block throughput with your trick is 1.25. With `lea` it was 1.5. My original function was 1.55. So `lea` hardly helps but your trick helps a lot. However, the only operation that gets a block throughput of 1.0 is with static arrays using `[32-bit address + index]`. – Z boson Oct 28 '15 at 11:29
@PeterCordes, I tried your `[rdi], [rdi + 8192], [rdi + 65536]` it's maybe 1% faster. My original method got about 63.5% of the max, your method gets 64.3% of the max, the lea method only get's about 54%. – Z boson Oct 28 '15 at 20:14
1

@PeterCordes, unrolling 16 times gets 94%. – Z boson Oct 28 '15 at 20:22

score 0 · Answer 2 · answered Oct 24 '15 at 02:51

6-uops sounds like a correct estimation if one doesn't do unroll. Intel Compiler usually does a good job on unrolling auto-vectorized loops if necessary. In this particular case even full unroll might make sense.

Not sure how you've got 8 elements at a time, because even with AVX you can get just 4 double precision values in a single 256bit ymm register.

Regarding the trip count. Even if you can do 8 elements at a time, it is going to be 12 and not 13, because last few elements (which can not be processed 8 at a time) will be done with the scalar code.

So from the compiler point of view it will look like:

int i=0;
for(; i<(100 & ~7); i+=8) // 12 iterations
    // Do vector code

for(;i<100; ++i)
    // Process loop remainder using scalar code

My my experience Intel unrolls twice so that could explain the factor of two (eight elements per iteration). — Z boson, Oct 26 '15 at 09:59

Intel Auto-Vectorization Trip Count Explanation?

2 Answers2