Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store

Question

I cannot understand why the first code has ~1 cycle per iteration and second has 2 cycle per iteration. I measured with Agner's tool and perf. According to IACA it should take 1 cycle, from my theoretical computations too.

This takes 1 cycle per iteration.

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    add rcx, 1 
    cmp rcx, n
    jle .begin

And this takes 2 cycles per iteration. but why?

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    nop
    add rcx, 1 
    cmp rcx, n
    jle .begin

This final version takes ~27 cycles per iteration. But why? After all, there is no dependency chain.

.begin:
    movnti [array], eax
    mov rbx, [array+16]
    add rcx, 1 
    cmp rcx, n
    jle .begin

My CPU is IvyBridge.

Avoid non-temporal stores if you are going to read it soon after. The whole point of non-temporal stores is write-only bursts that don't cause lines to be "write allocated" (read) into the cache. Use of the instruction tells the CPU "don't put this in the cache, just write it out to memory, I promise not to use it soon" — doug65536, May 08 '16 at 18:36
Related question from the same user about [why `movnti` in a loop isn't slower](http://stackoverflow.com/questions/37100450/too-fast-loop-why). — Peter Cordes, May 08 '16 at 18:41

Peter Cordes · Accepted Answer · 2016-05-08T18:28:05.470

2

movnti is 2 uops, and can't micro-fuse, according to Agner Fog's tables for IvyBridge.

So your first loop is 4 fused-domain uops, and can issue at one iteration per clock.

The nop is a 5th fused-domain uop (even though it doesn't take any execution ports, so it's 0 unfused-domain uops). This means the frontend can only issue the loop at one per 2 clocks.

See also the x86 tag wiki for more links to how CPUs work.

The 3rd loop is probably slow because mov rbx, [array+16] is probably loading from the same cache line that movnti evicts. This happens every time the fill-buffer it's storing into is flushed. (Not every movnti, apparently it can rewrite some bytes in the same fill-buffer.)

edited May 08 '16 at 18:28

answered May 08 '16 at 18:22

Peter Cordes

328,167
45
605
847

1

What is fill-buffer? Yes, when `mov rbx, [array+64]` insted of `mov rbx, [array+16]` the loop is fast again. Cache line is just 64 bytes. – May 08 '16 at 22:30
@J.Doe. See my answer to your other `movnti` question; I included a link to an Intel article that talks about write-combining fill buffers. – Peter Cordes May 08 '16 at 22:35
Yes, I see it :) Thanks – May 08 '16 at 22:37

Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store

1 Answers1

Linked