Understanding Cycles Per Element for For Loops

Question

I understand (vaguely) what Cycles Per Instruction (CPI) and Instructions Per Cycle (IPC) mean. CPI is the number of clock cycles required to execute the program divided by the number of instructions executed running the program. IPC on the other hand is the number of instructions executed while running a program divided by the number of clock cycles required to execute the program.

However, I am having trouble understanding what Cycles Per Elements mean when associated with loops.

For example, in the following code,

void combine4(vec_ptr v, data_t *dest) {
    long i;
    long length = vec_length(v);
    data_t *d = get_vec_start(v);
    data_t t = IDENT;
    for (i = 0; i < length; i++)
       t = t OP d[i];
    *dest = t; 
}

we can make multiple optimizations by changing the for loop style.

one method is known as loop unrolling

/* Combine 2 elements at a time */
for (i = 0; i < limit; i += 2) {
     x = (x OP d[i]) OP d[i + 1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
     x = x OP d[i];
}

to improve it even more, we can put parentheses around the array accesses.

x = x OP (d[i] OP d[i + 1]);

I believe we start calculating the next loop's information because we don't have dependent data. How would the term CPE apply to this optimization? Would CPE decrease? because it takes fewer cycles to run through all the elements?

Are those parentheses in the right place? Surely you need to evaluate `x OP d[i]` before you can evaluate `OP d[i+1]`? — Tim Randall, Jun 07 '21 at 18:30
@timRandall, i don't think its necessary?, for example, if OP, were to be an addition operator, either x OP d[i] or d[i] op d[i+1] could be evaluated first — , Jun 07 '21 at 18:35
Side note: If your compiler is not unrolling loops when there is benefit to doing so, you may want to consider using a better compiler. — user4581301, Jun 07 '21 at 18:37
My understanding is that Cycles Per Instruction is the amount of clock cycles that elapse while executing a single, specific instruction. Many processor data sheets and program guides list Cycles Per Instruction for each instruction. It is not an average over different instructions (e.g. program). — Thomas Matthews, Jun 07 '21 at 18:37
@user4581301 aren't there a lot of cases of unrolling done by the compiler, such as -O1 optimization, isn't always as effective as reorganizing the code — , Jun 07 '21 at 18:43
The term Instructions Per Cycle may be a profiling or benchmarking term, that is annoying. The problem is that different instructions require different amount of clock cycles. For example, a register to register copy/move may take less clock cycles than an instruction that loads a register from memory using offset and indexing (or even a multiplication or division instruction). This is why it is annoying. The number may change using the same program, depending on conditional instructions (`if` statements). — Thomas Matthews, Jun 07 '21 at 18:43
Don't forget that instruction cache and memory cache affect the Instructions per Clock Cycle. For example, a load register from memory instruction can be very quick if the value is in the cache, and can be a lot slower if the cache has to be reloaded. Similarly with the instruction cache. — Thomas Matthews, Jun 07 '21 at 18:46
@AkashiLi Yes, in the case where `a OP b == b OP a` you can evaluate the items in any order. Question for you: how safe do you think that assumption is? Can you think of a definition of `OP` for which it isn't true? — Tim Randall, Jun 07 '21 at 18:52
Agreed, and you can fool the compiler outright, but I'd start by moving the functions that I need highly optimized to a file to be compiled with a higher optimization level and see if that's good enough. Usually code confusing enough to fool the compiler isn't going to benefit appreciably from removing the loop overhead. I'll go to great lengths to avoid complicating code unless I really, REALLY need that last percent. — user4581301, Jun 07 '21 at 18:57
@TimRandall: Note that the key property here is associativity (e.g. `(a+b) + c == a + (b+c)`), not commutativity (`a+b == b+a`). Many operations that have one property also have the other, but they're not the same thing. Notably, floating-point operations like `+` are commutative not *quite* associative (because of rounding error), so compilers can only do optimizations like using multiple accumulators (or unrolling in this way to shorten the loop-carried dependency chain) with options like `-ffast-math`. — Peter Cordes, Jun 08 '21 at 02:28
(Also note that different results doesn't mean worse: for operations like addition (OP=`+`) where adding a small number to a large number loses precision, multiple accumulators, or adding pairs first, is numerically better. [Simd matmul program gives different numerical results](https://stackoverflow.com/a/55478102) explains that it's the first step towards pairwise summation. Again, this is relevant for FP addition, because of the details of addition, not just because it's associative except for rounding error. FP multiplication wouldn't have the same precision benefit, just performance.) — Peter Cordes, Jun 08 '21 at 02:31
@PeterCordes the problem is that we don't know is `OP` is defined to be associative. The question doesn't state that. It just assumes that "we can put parentheses around the array accesses" without changing the result of the calculation. This wouldn't be the first time an engineer has tried to manually optimize some code and introduced an error, rather than letting the compiler do it automatically. — Tim Randall, Jun 08 '21 at 11:55
@TimRandall: Well that's the thing; FP `+` isn't *strictly* associative, so compilers can't do it for you. (Unless you use `-fassociative-math` or `-ffast-math`). This specific question seems to be from a book, apparently discussing associative OPs in general. Or just talking about the performance when you make that change, whether the result is (exactly) the same or not. (e.g. for FP math it probably won't be). But anyway, commutativity is a different property, and doesn't imply associativity. (And conversely, you can re-associate integer `a + b - c` as `a + (b - c)`, or as `(a-c) + b`) — Peter Cordes, Jun 08 '21 at 12:24
@TimRandall: Related [Associativity gives us parallelizability. But what does commutativity give?](https://stackoverflow.com/q/35443424). (And BTW, that `(a-c) + b` transformation relies on `+` being commutative. Or I'm not sure how to describe exactly what's going on in `a - b - c` == `a - (b+c)` - it's sort of inverse associativity, or perhaps distributive). Anyway, I originally replied to your first comment because the expression you used was for commutativity, not associativity. Compiler vs. human optimization of operations that are approximately associative is an interesting tangent — Peter Cordes, Jun 08 '21 at 12:30

score 2 · Answer 1 · answered Jun 07 '21 at 18:47

Cycles Per Element is the term used for the number of CPU cycles per iteration of a loop when you are iterating over an array, vector or other container of elements. For combine4(), you would measure the total number of CPU cycles it takes to run the whole for-loop, and divide it by length to get the CPE.

Compilers nowadays are very good at optimizing code, for example they might unroll or vectorize loops automatically. They can also change the order in which operations are executed, using knowledge of instruction timing and other micro-architectural details to produce the most optimal sequence of instructions, as long as it can prove that this reordering does not change the final result. So manual changes might not do what you expect to happen. Apart from benchmarking your changes, you should also look at the compiler's assembly output to see what kind of machine code it generates. Of course don't forget to enable the compiler optimizations. A nice web-based tool to do this is https://godbolt.org/.

score 2 · Accepted Answer · answered Jun 07 '21 at 18:58

2

(Your code examples are taken from the textbook, Computer Systems: A Programmer's Perspective.)

Cycles per element is a higher-level metric here. Rather than measuring the CPI or IPC, the actual unit that matters for the example loops are the elements of the vector. Therefore, in running the loop across hundreds or thousands of elements and measuring the cycles that this entire execution requires, we can plot the resulting measurements and compute the slope (which is cycles per element).

In moving the parentheses, this changed the association of the operations, thus changing the OP of the two data elements to be independent. This can improve CPE of the loop, if there are sufficient hardware resources to execute the additional independent operations.

The takeaway should be that determining whether even a simple change to the code will improve execution time is difficult, given that it relies on all parts of the system. Typically, a programmer should rely on the compiler and its excellent optimizations to achieve good performance, without needing to resort to fine-grained tuning.

answered Jun 07 '21 at 18:58

Brian

2,693
5
26
27

The key thing with combining two elements separately from updating `x` is that it's no longer part of the latency of the loop-carried dependency chain. So it reduces the critical path latency. For something like summing an array or a dot-product, normally you'd use multiple accumulators to hide latency (especially FP, which has longer latency than integer add or mul). ([Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527)) – Peter Cordes Jun 07 '21 at 22:27
The textbook also goes into the example of having multiple accumulators, as well as loop unrolling. They plot across different quantities of each. @PeterCordes, you are correct that the benefit is reducing the critical path. Whether this reduces the latency again will depend on the processor's resources, such as issue ports, ALUs, etc. – Brian Jun 08 '21 at 04:26
Do you mean latency as in total time? That's not what I meant, I meant latency of the ALU operation: how soon another operation can start which uses that result, in the superscalar / out-of-order back end of the CPU. It speeds up the loop if ALU latency was the bottleneck for the loop (which it often is in FP loops). Latency is one of three things you can measure about an instruction, the others being front-end throughput (fused-domain uops on Intel CPUs) and back-end execution ports it can run on. – Peter Cordes Jun 08 '21 at 04:53
e.g. `addps xmm, [rdi]` on Skylake is 1 uop for the front-end. The load part runs on port 2 or 3 (with ~6c latency from address to data assuming an L1d cache hit), and the FP ALU part [runs on port 0 or 1](https://uops.info/html-tp/SKL/ADDPS_XMM_M128-Measurements.html), with [4c latency from XMM0 input to output](https://uops.info/html-lat/SKL/ADDPS_XMM_M128-Measurements.html#lat2-%3E1_mem). https://uops.info/. – Peter Cordes Jun 08 '21 at 04:59
If you add up the latency along each loop-carried dep chain, that's one possible bottleneck, and separately add up the front-end uops that's another, and separately look for back-end port contention that's the 3rd possible bottleneck, assuming ideal scheduling and no stalls (cache miss / branch miss). [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/a/51622129) (Ironically the title of that question uses "latency" the way I was complaining about, but see my answer there :P) – Peter Cordes Jun 08 '21 at 04:59

Understanding Cycles Per Element for For Loops

2 Answers2