2

I consider an example from Optimizing Assembly from Agner Fog. He tests :

Example 12.6b. DAXPY algorithm, 32-bit mode

L1:
    movapd xmm1, [esi+eax]   ; X[i], X[i+1]
    mulpd  xmm1, xmm2        ; X[i] * DA, X[i+1] * DA
    movapd xmm0, [edi+eax]   ; Y[i], Y[i+1]
    subpd  xmm0, xmm1        ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
    movapd [edi+eax], xmm0   ; Store result
    add eax, 16              ; Add size of two elements to index
    cmp eax, ecx             ; Compare with n*8
    jl L1                    ; Loop back

and he estimates ~6 cycle/iteration on Pentium M. I tried to make the same but on my CPU- Ivy Bridge. And I achieved 3 cycle per iteration but from my computations on paper it is possible to get 2 cycles. And I don't know if I made mistake in theorethical computations or it can be improved.

What I did:

enter image description here

From http://www.realworldtech.com/sandy-bridge/5/ I know that my CPU can retire 4 microops for cycle so it is not a bottleneck.

#uops fused = 8 / 4 = 2

So 2 is a current our bottleneck. Let's see another possibilites:

Microps have pattern: 1-1-1-1-1-1-1-1 and according to Agner Fog my CPU has pattern 1-1-1-1 ( and others). From that we can see that it is possible to decode instructions in 2 cycle. It is no bottleneck. Moreover, SnB cpus have microcache so neither fetching nor decoding should be the bottleneck.

Size of instruction in bytes is 32 so it fit into micro-cache window ( 32 bytes).

From my experiments, when I add nop instruction it increase number of cycle per iteratrion ( approx 0.5 cycle).

So, the question is:

Where is that ONE cycle? :D

Gilgamesz
  • 4,727
  • 3
  • 28
  • 63

1 Answers1

1

Are you sure you're not limited by memory bandwidth? You need to test with arrays that fit in L1 cache, because you're depending on two loads and one store per 2 clocks. (That's more than half of IvB's theoretical max of two 128b memory ops per clock, with at most one of them being a store.)


Decoding can't be relevant because your loop fits in the loop buffer. (It's less than 28 uops). So it just issues in groups of 4 already-decoded uops.


Your fused-domain uop counts are wrong. cmp/jl can macro-fuse into one compare-and-branch uop. However, that mistake is balanced by another mistake due something that's not in Agner Fog's guide (yet).

movapd [edi+eax], xmm0 can't micro-fuse on IvB, so it's 2 fused-domain uops.

SnB-family CPUs can only micro-fuse memory operands that don't use an index register. I recently found official confirmation in Intel's optimization manual that explains the different results from Agner's testing vs. my testing: such addressing modes can micro-fuse in the decoders and uop cache, but not in the OOO core. (See my answer on that question. I should send Agner another message to let him know that Intel's docs sorted out our confusion...)

Try this:

    add    ecx, edi          ; end-of-Y pointer
    sub    esi, edi          ; esi = X-Y
L1:
    movapd xmm1, [esi+edi]   ; 1 uop
    mulpd  xmm1, xmm2        ; 1 uop
    movapd xmm0, [edi]       ; 1 uop
    subpd  xmm0, xmm1        ; 1 uop
    movapd [edi], xmm0       ; 1 micro-fused uop
    add edi, 16              ; 1 uop
    cmp edi, ecx             ; loop while (dst < end_Y)
    jb L1                    ; cmp+jb = 1 macro-fused uop

Loads don't need to micro-fuse, but stores are 2 fused-domain uops. (store-address and store-data).

IACA would have told you that the store was 2 uops and couldn't micro-fuse. It's worth having a look at. Sometimes its numbers are wrong (e.g. it thinks shrd is still slow on SnB), but often it's useful as long as you realize it's a simplistic approximation to the real hardware behaviour, and not a cycle-accurate simulator.


My version is 7 total fused-domain uops. So it should run at one iteration per 2 clocks on SnB-family CPUs. Your original was 8 uops, so this change shouldn't make any difference. I wrote it before noticing that you didn't account for macro-fusion of the cmp/jcc, so I was thinking your loop was actually 9 uops. Since adding a single nop slows your code down, that's additional evidence that it's 8 fused-domain uops. If cache misses from testing with arrays too large doesn't explain it, then maybe IvB is doing badly at scheduling the load/store uops somehow? Seems unlikely, since they all must use port 2 or 3 for store-address or load uops. (In the unfused domain, store-data uops go to port4).

Are you sure you really got one iteration per 3 cycles with your loop? It doesn't make sense that adding a nop could slow it down beyond that, because a 9 uop loop should issue in 3 cycles.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1. Thanks! :) 2. You obviously right. I tested my code on 2048 and 4096 array elements ( so it fits into L1) and I take ~2cycle per iteration in fact. 3. "If cache misses from testing with arrays too large doesn't explain it, " They explain it indeed- When I make bigger array it increases a number of cycles per iteration. 4. I still have a problem what does it exactly mean that uops are fused. I know what does it mean literally- Some of uops are joined. But how? – Gilgamesz Apr 25 '16 at 08:17
  • 1
    Agner Fog's microarch guide and Intel's optimization manual should have a decent explanation of uop micro and macro fusion. In Agner Fog's microarch pdf, he goes into more detail about fusion when describing it for the first CPU to implement it: Pentium M. See page 90/91: "Section 7.6 Micro-op fusion" is about 2 pages long. I'm not trying to reproduce complete explanations for everything in every answer, that would take forever. That's why I just linked to my other answer about micro-fusion of indexed addressing modes on SnB-family CPUs. Of course you need to read the link. – Peter Cordes Apr 25 '16 at 08:33
  • Thanks. I don't ask you to explain me everything. Links, other reference are fine. – Gilgamesz Apr 25 '16 at 08:36