Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

Question

section .text

%define n    100000

_start:
xor rcx, rcx

jmp .cond
.begin:
    movnti [array], eax

.cond:
    add rcx, 1 
    cmp rcx, n
    jl .begin


section .data
array times 81920 db "A"

According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.

P.S Is there any loop-carried-dependency?

EDIT

section .text

%define n    100000

_start:
xor rcx, rcx

jmp .cond
.begin:
    movnti [array+rcx], eax

.cond:
    add rcx, 1 
    cmp rcx, n
    jl .begin


section .data
array times n dq 0

Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.

Related Q from the same user: http://stackoverflow.com/questions/37101644/unexpected-slowdown-from-inserting-a-nop-in-a-loop-and-from-reading-near-a-movn — Peter Cordes, May 08 '16 at 18:56

score 3 · Accepted Answer · edited May 23 '17 at 12:09

movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.

I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)

So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.

DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.

We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.

The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).

For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.

A more reasonable microbenchmark shows one iteration (one movnti) per cycle:

global _start
_start:
    xor ecx,ecx
.begin:
    movnti  [array], eax
    dec     ecx
    jnz     .begin         ; 2^32 iterations

    mov eax, 60     ; __NR_exit
    xor edi,edi
    syscall         ; exit(0)

section .bss
array resb 81920

(asm-link is a script I wrote)

$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address 

 Performance counter stats for './movnti-same-address':

       1835.056710      task-clock (msec)         #    0.995 CPUs utilized          
     4,398,731,563      cycles                    #    2.397 GHz                    
    12,891,491,495      instructions              #    2.93  insns per cycle        
       1.843642514 seconds time elapsed

Running in parallel:

$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &

real    0m1.844s / user    0m1.828s    # running alone
[1] 12523
[2] 12524
peter@tesla:~/src/SO$ 
real    0m1.855s / user    0m1.824s    # running together
real    0m1.984s / user    0m1.808s
# output compacted by hand to save space

I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)

zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.

However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.

P.S Is there any loop-carried-dependency?

Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.

The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".

This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.

zx485 · Answer 2 · 2016-05-08T21:57:27.727

1

The result is plausible. Your loop code consists of the follwing instuctions. According to Agner Fog's instruction tables, these have the following timings:

Instruction    regs     fused  unfused  ports   Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI         m,r      2      2        p23 p4  ~400     1
ADD            r,r/i    1      1        p0156   1       0.25    
CMP            r,r/i    1      1        p0156   1       0.25    
Jcc            short    1      1        p6      1       1-2    if predicted that the jump is taken
Fused CMP+Jcc  short    1      1        p6      1       1-2    if predicted that the jump is taken

So

MOVNTI consumes 2 uOps, 1 in port 2 or 3 and one in port 4
ADD consumes 1 uOps in port 0 or 1 or 5 or 6
CMP and Jcc macro-fuse to the last line in the table resulting in a consumption of 1 uOp

Because neither ADD nor CMP+Jcc depend on the result of MOVNTI they can be executed (nearly) in parallel on recent architectures, for example using the ports 1,2,4,6. The worst case would be a latency of 1 between ADD and CMP+Jcc.

This is most likely a design error in your code: you're essentially writing to the same address [array] a 100000 times, because you do not adjust the address.

The repeated writes can even go to the L1-cache under the condition that

The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.

but it doesn't look like this and won't make for a great difference, anyway, because even if writing to memory, the memory speed will be the limiting factor.

For example, if you have a 3GHz CPU and 1600MHz DDR3-RAM this will result in 3/1.6 = 1.875 CPU cycles per memory cycle. This seems plausible.

edited May 08 '16 at 21:57

answered May 08 '16 at 17:20

zx485

28,498
28
50
59

I cannot still understand. Single operation ( not in loop) movnti takes ~120 cycles. I understand that in loop ( without dependencies) it can be executed a lot of operation concurrently. But how much is it? The problem is that it is memory operation, so it must be limited. It is said: Read/Write to memory are slow. Please explain :) – May 08 '16 at 17:39
2

Well, I don't have a good explanation for the latency value of ~400 either. It is the result of experimentation. But that is not the problem: the thing is you are writing the same 4 bytes over and over to the same address of `[array]` -the mentioned design error. You can replace the line `movnti [array], eax` with `movnti array[rcx*4], eax`(you may have to adjust that to the syntax of your assembler) and measure again to see if it yields a different result. But, as explained in my answer, the memory _is_ the bottleneck as calculated in the last line. But not so bad anymore as in times past :-) – zx485 May 08 '16 at 17:58
1

I don't think the last line makes sense, dividing CPU clock speed by RAM speed. That's more of a coincidence. The loop is 4 instructions, so one NT store per RAM clock would mean `4 * 1600 / 3000 = 2.133`. That's just totally implausible, and not how DRAM works in the first place. It only supports burst transfers that take multiple cycles to set up. – Peter Cordes May 08 '16 at 18:52
@Peter Cordes: I did a minor change to the last sentence(ins => cycles) --- I am unsure how many uOps could be processed per cycle, but I assume more than one. [In this answer about uOps](http://stackoverflow.com/a/31027695/1305969), you wrote "is bottlenecked on the 4-uops-per-cycle fused-domain limit of Intel CPUs." Practically, from my experience, the limit _is_ the memory bandwidth. This could be many ins with one (not cached/NT) memory access. So when I reach memory speed, I stop optimizing. It won't get better with the average algorithm. – zx485 May 08 '16 at 19:15
@zx485: yup, most code bottlenecks on memory bandwidth, unless you can do a cache-blocking aka loop-tiling optimization to have it reuse data while it's in the cache. You're right that *if* this code was actually writing to memory with every `movnti`, it would be slower. This is why I conclude that it isn't. – Peter Cordes May 08 '16 at 19:21
The pipeline in Intel Core2/Nehalem and SnB-family CPUs has a max throughput of [4 fused-domain uops per clock, limited by issue/rename and retirement bandwidth.](http://stackoverflow.com/a/37062887/224132) It is possible to achieve this when you aren't bottlenecked by memory. – Peter Cordes May 08 '16 at 19:25
1

Your edit has turned your wrong idea about how memory works into the main point of the answer. You're assuming that DDR3 or DDR4 DRAM can accept a 4B write every memory clock cycle. That's not even close to how DRAM works. It can only transfer in 64B bursts, but there's also a command to interrupt a burst read/write part way through. I'm almost certain that overlapping writes are getting combined inside the core before reaching the memory controller. IDK if the memory controller does a read-modify-write of a partially-modified DRAM 64B chunk, or if it interrupts a burst. – Peter Cordes May 08 '16 at 20:44
See Ulrich Drepper's [What Every Programmer Should Know About Memory](http://www.akkadia.org/drepper/cpumemory.pdf). For the commands a memory controller can send to DRAM, see [Wikipedia's SDRAM article](https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#Commands). That article says that all DDRx generations use essentially that command set, so yes, this does still apply to DDR3/DDR4. You just can't do a separate 4B write every DRAM clock cycle, **so the CPU must be merging the overlapping writes**. – Peter Cordes May 08 '16 at 20:46
**We could test this by running the same code on multiple cores**. If they're actually bottlenecked on DRAM, they'll slow each other down to about the same total throughput. If not, they'll each run at about the same speed, with maybe some slowdown. I predict the latter. @J.Doe: can you test this? – Peter Cordes May 08 '16 at 20:52
@Peter Cordes: Yes. I would appreciate if J.Doe would test this. As long as there is no specification of his test rig, the performance of my theory remains empirically indeterminate. However, I'll revert my edit and am probably out of here. – zx485 May 08 '16 at 21:57
@zx485: I tested myself and found that simultaneous runs don't compete with each other. Moreover, the 1.82 was only a result of using way too small an iteration count. IIRC, the OP's other question says he has an IvyBridge CPU. It's easy to compile and run this code on any x86 system with NASM or YASM installed. See my updated answer. – Peter Cordes May 08 '16 at 22:02
@PeterCordes, I am eager to test it. But, please reference me to get know how to write mulithreading code in assembly. I have experience multithreading only in C++ :) I edited my post – May 08 '16 at 22:22
@J.Doe: just run the single-threaded program multiple times, like I did in my updated answer. You don't need or want any shared memory between the copies. I used the Unix shell's ability to run a process in the background to get two copies started with one keypress. The test runs for ~1.8 seconds, which is plenty longer than the startup overhead. Process startup overhead is low enough that they're both running at once for almost all of their running time. It's long enough that even return -> alt-tab -> return could be good enough manually. – Peter Cordes May 08 '16 at 22:26
I tested. I confirmed your hypothesis :) Thanks a lot :) – May 08 '16 at 23:03

Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

EDIT

2 Answers2

Running in parallel:

Linked