Estimating Cycles Per Instruction

Question

I have disassembled a small C++ program compiled with MSVC v140 and am trying to estimate the cycles per instruction in order to better understand how code design impacts performance. I've been following Mike Acton's CppCon 2014 talk on "Data-Oriented Design and C++", specifically the portion I've linked to.

In it, he points out these lines:

movss   8(%rbx), %xmm1
movss   12(%rbx), %xmm0

He then claims that these 2 x 32-bit reads are probably on the same cache line therefore cost roughly ~200 cycles.

The Intel 64 and IA-32 Architectures Optimization Reference Manual has been a great resource, specifically "Appendix C - Instruction Latency and Throughput". However on page C-15 in "Table C-16. Streaming SIMD Extension Single-precision Floating-point Instructions" it states that movss is only 1 cycle (unless I'm understanding what latency means here wrong... if so, how do I read this thing?)

I know that a theoretical prediction of execution time will never be correct, but nevertheless this is important to learn. How are these two commands 200 cycles, and how can I learn to reason about execution time beyond this snippet?

I've started to read some things on CPU pipelining... maybe the majority of the cycles are being picked up there?

PS: I'm not interested in actually measuring hardware performance counters here. I'm just looking to learn how to reasonable sight read ASM and cycles.

Have you looked at Agner Fog's Work? http://www.agner.org/optimize/instruction_tables.pdf — David Hoelzer, Mar 30 '16 at 02:43
you cannot count cycles anymore. not for a long time now. Pipelining, caches, branch prediction, etc...Pipelining is just an assembly line like in a factory. You might have 117 steps or stations to build one thing and each may take 30 seconds, but that means you can in theory get one item produced every 30 seconds not one per hour because of the production line. that is how fast they come out the back end. — old_timer, Mar 30 '16 at 03:10
Most if not all instructions take one clock to EXECUTE, they may take 15 to 1000 clocks to fetch through cache misses to relatively slow dram, and all the other step sin the pipeline, but once all the inputs are present at the execution unit, it then takes one clock for that one step in the pipeline. Then all the stuff that follows, saving the results in registers or slow memory, etc. — old_timer, Mar 30 '16 at 03:11
dont get me wrong there is a lot of value in what you are doing, understanding what the compiler is producing, how to manipulate the compiler to produce different things. Understand that number of instructions is not equal to speed. memory instructions take longer than register instructions, again if you have cache misses you could have dozens of register instructions out run a single memory based instruction all other things held equal. with multiple execution units, pipes, you can rearrange instructions that dont have to be specifically in that order and change the overall performance — old_timer, Mar 30 '16 at 03:15
and other fun things like add one or more nops or remove them to change the cache alignment of the whole program or portions of, and see the performance change. — old_timer, Mar 30 '16 at 03:15
folks bash me for this but https://github.com/jagregory/abrash-zen-of-asm is still extremely relevant, not for the 8088 which even at the time of publication was outdated. But the how to think about the problems and attack them and not jump to conclusions...still relevant, I use what I learned there decades ago regularly in my day job and hobbies. — old_timer, Mar 30 '16 at 03:18
@dwelch While jumping around in Mike Acton's talk, I noticed at ~30:00 he cites Jason Gregory who apparently found that Main Ram to CPU is ~200 cycles, L2 to CPU is ~20, L1 to CPU is ~3 and Registers to CPU is free. Assuming a 64-byte L2 Cache, I can see now that 2 x 32 reads will fit on ONE cache line and, read from memory, come out to equal 200 cycles. It all makes sense now! — Stradigos, Mar 30 '16 at 05:01
@dwelch And thank you for your response, it was definitely helpful. The world below the compiler is mostly a mystery to me. I know Mike Acton and other Data-Oriented Designers probably catch a lot of hate, but they are truly onto something. And what's so strange to me about that is that none of this is new... just forgotten among the luxuries of modern CPU architectures and compilers. It's really about a return to first order principals and how it can benefit performance critical applications. It's really set my mind to something. Thanks for illuminating the subject even more! — Stradigos, Mar 30 '16 at 05:04
If that cache line is already hot, then both loads can execute in parallel, and be ready about 4 or 5 cycles after `%rbx` was ready (or from when the out-of-order execution core saw the load instructions, if there's no need to wait on `%rbx`). (Intel SnB-family CPUs have an L1 load-use latency of 4 cycles for integer loads. I forget if it's another 1c for FP loads). If that line is cold, then both loads will be in flight, and both will finish once the cache line is received by the core. Read Agner Fog's microarch pdf. Also see [the x86 tag wiki](http://stackoverflow.com/tags/x86/info). — Peter Cordes, Mar 30 '16 at 11:31
Counting instructions is a pretty poor measure on modern cores with speculative execution. Intel processors count the number of *retired* instructions, the ones whose result was actually used. But it is a pretty random number, it ultimately depends on the data you process. If you care about perf then you only care about *time*. That's easy to measure, don't use cycles. — Hans Passant, Mar 30 '16 at 13:31

score 2 · Accepted Answer · answered Mar 30 '16 at 13:11

As you already pointed out, the theoretical throughput and latency of a MOVSS instruction is at 1 cycle. You were looking at the right document (Intel Optimization Manual). Agner Fog (mentioned in the comments) measured the same numbers in his Intruction Tables for Intel CPUs (AMD is has a higher latency).

This leads us to the first problem: What specific microarchitecture are you investigating? This can make a big difference, even for the same vendor. Agner Fog reports that MOVSS has a 2-6cy latency on AMD Bulldozer depending on the source and destination (register vs memory). This is important to keep in mind when looking into performance of computer architectures.

The 200cy are most likely cache misses, as already pointed out be dwelch in the comments. The numbers you get from the Optimization Manual for any memory accessing instructions are all under the assumption that the data resides in the first level cache (L1). Now, if you have never touched the data by previous instructions the cache line (64 bytes with Intel and AMD x86) will need to be loaded from memory into the last level cache, form there into the second level cache, then into L1 and finally into the XMM register (within 1 cycle). Transfers between L3-L2 and L2-L1 have a throughput (not latency!) of two cycles per cache line on current Intel microarchitectures. And the memory bandwidth can be used to estimate the throughput between L3 and memory (e.g, a 2 GHz CPU with an achievable memory bandwidth of 40 GB/s will have a throughput of 3.2 cycles per cache line). Cache lines or memory blocks are typically the smallest unit caches and memory can operate on, they differ between microarchitectures and may even be different within the architecture, depending on the cache level (L1, L2 and so on).

Now this is all throughput and not latency, which will not help you estimate what you have described above. To verify this, you would need to execute the instructions over and over (for at least 1/10s) to get cycle accurate measurements. By changing the instructions you can decide if you want to measure for latency (by including dependencies between instructions) or throughput (by having the instructions input independent of the result of previous instructions). To measure for caches and memory accesses you would need to predict if an access is going to a cache or not, this can be done using layer conditions.

A tool to estimate instruction execution (both latency and throughput) for Intel CPUs is the Intel Architecture Code Analyzer, which supports multiple microarchitectures up to Haswell. The latency predictions are to be taken with the grain of salt, since it much harder to estimate latency than throughput.

IACA in general has to be taken with a grain of salt. It assumes no cache misses, and has some limitations to how it models things. IDK why you say it's harder to estimate latency than throughput, though. Almost all instructions have fixed latency, because that makes the out-of-order scheduler's job much easier (in terms of avoiding write-back conflicts where an execution port tries to produce two results in the same cycle). Anyway, since IACA doesn't simulate cache misses, the only tricky thing for latency is resource conflicts that stop an insn from executing as soon as inputs are ready. — Peter Cordes, Mar 30 '16 at 16:37
Your right. Latency within the core, if all data is assumed available in L1, is not harder to estimate than throughput. But once it depends on loads from other cache levels or memory it becomes very hard to predict is much further off than throughput. This is of cause owing to the concept of latency, that one single delay will increase the overall latency, whereas a single delay decreases the throughput affecting the total number of instructions only sightly. So yes, IACA predictions have to always be taken with a grain of salt, but the nature of throughput makes it more lenient to errors. — como, Mar 31 '16 at 11:08
So you're saying that IACA latency numbers are less likely to match reality than its throughput numbers. That makes some sense, as long as you don't have many L3 misses. Main memory latency is enough to stall the pipeline and affect throughput, too, but *if* memory latency isn't part of the critical path, even L2 misses that hit in L3 might not have much impact on throughput. — Peter Cordes, Mar 31 '16 at 12:19
Of cause, but the throughput from memory (as well as from L3 and L2) to L1 can be reasonably estimated and usually matches measurements. See Figure 4 in the following paper: [Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft](http://arxiv.org/abs/1509.03778). The L1 to register throughput is estimated using IACA and all other transfers times are calculated based on documentation by Intel and memory bandwidth measurements. This would not be suitable for latency. But then latency usually interests real-time systems people, which worry about worst-case execution time only. — como, Apr 01 '16 at 07:23

Estimating Cycles Per Instruction

1 Answers1

Linked