Two loop-carried dependency chain. Which one matters?

Question

My CPU is IvyBridge. Let's consider example from Agner's Fog optimizing_assembly, I mean. 12.6 chapter and the 12.10a example:

    movsd xmm2, [x] 
    movsd xmm1, [one] 
    xorps xmm0, xmm0  
    mov eax, coeff    

    L1:
        movsd xmm3, [eax] ; c[i]
        mulsd xmm3, xmm1  ; c[i]*x^i
        mulsd xmm1, xmm2  ; x^(i+1)
        addsd xmm0, xmm3  ; sum += c[i]*x^i
        add   eax, 8
        cmp eax, coeff_end
        jb L1

And the frontend is not a bottleneck ( it is obvious because of the latency of multiplication).

We have two loop-carried dependency ( I skipped add eax, 8): 1. mulsd xmm1, xmm2 latency of 5 cycles 2. addsd xmm0, xmm3 latency of 3 cycles

Btw, I have a proble to decide: Should I sum up ( 5 + 3 = 8) or to get the greatest, i.e. 5 cycle?

I've tested it for 10000000 elements array. And it takes 6.7 cycle per iteration ( according to perf) and 5.9 cycles per iteration according to Agners' tool.

Please explain why it does take 6/7 cycles instead of just 5 cycles?

Can you try to write more useful question titles? "Loop too slow" doesn't tell anyone anything except that you should have used the [optimization] or [performance] tags. Some of your previous questions have had really terrible titles, too, like "Too much cycles". — Peter Cordes, May 20 '16 at 05:20
Looks like the longest loop-carried dep chain is the `mulsd xmm1, xmm2`. Next iteration's `mul` can start before this iteration's `addsd` finishes, because it doesn't depend on `xmm0`. The other instructions depend on that loop-carried mul chain, but they fork off a separate dep chain for each iteration (except for the `addsd`, which is also loop-carried, but lower latency). — Peter Cordes, May 20 '16 at 05:46

score 1 · Accepted Answer · edited May 23 '17 at 10:28

Probably resource conflicts are sometimes delaying the loop-carried mulsd. uop scheduling chooses oldest-first (out of uops that have their inputs ready), not critical-path-first, so the mulsd xmm3, xmm1 probably steals the execution port and delays the loop-carried mulsd xmm1, xmm2 by a cycle occasionally.

There are probably also occasional pipeline stalls from cache misses, since hardware prefetching doesn't work across page boundaries. Test for this by putting an outer repeat-loop around an inner loop over ~128kiB of data (1/2 L2 cache size).

prefetch should have no problem keeping up with your loop normally, though: One 64b load per 5 cycles is about 1/5th of main memory bandwidth.

You could also use float instead of double

Does your data have any denormals? Those are slow (but NaN isn't, with SSE).

Two loop-carried dependency chain. Which one matters?

1 Answers1

Linked