how to optimize this code with unrolling factor 3?

Question

void randomImprovedfunction(double a[], double p[], long n)
2 {
3      long i;
4      double last_v, v;
5      last_v = p[0] = a[0];
6      for (i=1; i<n; i++) {
7          v = last_v + a[i];
8          p[i] = v ;
9         last_v = v;
10     }
11
12     return ;
13 }

I have this function. as an exercise, i am told that it can be optimized using an unrolling factor of 3 and changing only lines 7-9. However, I am really lost on how this would be done.

Could someone show it to me?

Thank you! any help is appreciated :)

I would have done it as `for (x=1;x+3 < n; x+=3) ...`, changing the line 6 as well (with need to handle the remainder n%3 !=0). Changing the lines 7 to 9 only means, that also the `n` means now multiples of 3 and lines 7,8,9 need to be copied three times with the correct index calculation. — Aki Suihkonen, Jun 28 '21 at 07:26
please remove the line numbers and just add comments on lines that you want to talk about [Why is there no line numbering in code sections?](https://meta.stackoverflow.com/q/252559/995714) — phuclv, Jun 28 '21 at 07:30
@AkiSuihkonen: Normally you want `x < n-2` or `x+2 < n`. Or `x+3 <= n`. I prefer `x < n-2` to make it easier to for the compiler to hoist a loop-invariant, since the +2 isn't the same as the counter increment. `<=` can be ok; signed-overflow UB lets the compiler know it terminates. Your loop control would leave 1..3 elements not done, rather than 0..2. (Clear evidence that manual loop unrolling is tricky; even experienced humans are prone to getting it wrong; best to use `clang -O3` and let it unroll, when that's viable, because auto-vectorization usually works better on idiomatic loops). — Peter Cordes, Jun 28 '21 at 14:51
@AkiSuihkonen: Or you need to include an extra `i+=2` and `if() break;` in the loop body, basically working around the nonsense requirement of only changing the loop body, not the control. Unless the array element count is supposed to be `n*3` in the unrolled version, so you just do `i*3 + 0..2`. — Peter Cordes, Jun 28 '21 at 14:53
Not much point unrolling this prefix sum, unless you're going to assume associative FP. Separate `v` and `last_v` variables aren't doing anything useful. I'd simplify it to `v += a[i]`. (Which would compile the same, but be easier for humans.) Maybe they want you to assume associative FP, though, so you could do `tmp = a[x + 0] + ... + a[x+2]` so the loop-carried dependency is just `v += tmp`, with the intermediate `v` values computed from the old v. So you're doing redundant work to sidestep latency — Peter Cordes, Jun 28 '21 at 14:59
See also [parallel prefix (cumulative) sum with SSE](https://stackoverflow.com/q/19494114) for more brute-force using SIMD SSE/AVX shuffles. — Peter Cordes, Jun 28 '21 at 15:00

John D · Answer 1 · 2021-06-29T23:06:03.303

Your main goal with unrolling is to make it easier for the CPU instruction pipeline to process instructions. Multiple instructions can be in process at the same time, and various factors can interrupt the smooth flow. Eg, data dependencies: if a later instruction needs to load data and that data is being changed by earlier instructions, the later instruction has to wait at its load stage until the earlier instructions have saved that data. That is called a pipeline stall.

See comments for why data dependency is the main bottleneck in this example.

The textbook example given in the Question seems to be mainly an exercise to get familiarity with manually unrolling loops and is not intended to investigate any performance issues.

Your first draft for the unrolling code looks like this, but you will get unwanted cases

for (i=1; i < n; i+=3) {
    v = last_v + a[i];
    p[i] = v ;
    last_v = v;
    v = last_v + a[i+1];
    p[i+1] = v ;
    last_v = v;
    v = last_v + a[i+2];
    p[i+2] = v ;
    last_v = v;
}

Unwanted cases - note that the last index you want to process is (n-1)

n	Unwanted cases
5	Array indexes 1,2,3 then 4,5,6 => the unrolled code processes 2 unwanted cases, index 5 and 6
6	Array indexes 1,2,3 then 4,5,6 => the unrolled code processes 1 unwanted case, index 6
7	Array indexes 1,2,3 then 4,5,6 => no unwanted cases

See also Handling unrolled loop remainder

So, eliminate the last loop if there are any unwanted cases and you will then have

// `nn` is the adjusted loop limit to avoid an extra loop with unwanted cases 
int nn = n;
if ( (n-1)%3 != 0 ) nn = n - 3;

for (i=1; i < nn; i+=3) {
    v = last_v + a[i];
    p[i] = v ;
    last_v = v;
    v = last_v + a[i+1];
    p[i+1] = v ;
    last_v = v;
    v = last_v + a[i+2];
    p[i+2] = v ;
    last_v = v;
}

At this point we need to handle the remaining/missing cases:

n	nn	values of iterator i
5	2	1,4 => final i = n - 1
6	3	1,4 => final i = n - 2
7	7	1,4,7 => final i = n

If i = n - 1, you have 1 missing case, ie index n-1
If i = n - 2, you have 2 missing cases, ie index n-2 and n-1
If i = n, you're done

if ( i == n - 1 ) { // 1 missing case
    v = last_v + a[n-1]
    p[n-1] = v;
}
if ( i == n - 2 ) { // 2 missing cases
    v = last_v + a[n-2]
    p[n-2] = v;
    last_v = v;
    v = last_v + a[n-1]
    p[n-1] = v;
}

*Your main goal with unrolling is to reduce the number of branch points.* - That's not the bottleneck in this loop on a modern CPU, like the one in the querent's textbook (see previous questions like [how will unrolling affect the cycles per element count CPE](https://stackoverflow.com/q/68115996)). The bottleneck is the FP dependency chain. Doing something about *that* is the key factor, unless we're tuning for hyperthreading, and want to use fewer uops even if we don't change the latency bottleneck. (See also my comments under the question about shortening the `v += ...` dep chain) — Peter Cordes, Jun 29 '21 at 00:02
Also, you don't need to use `%3` to find a new loop upper bound; since we're using `<` instead of calculating an exact `==` termination condition (only useful on MIPS), `nn = n - 2` always works. With signed `n` you don't even have to special case `n < 2`. (Also discussed in comments under the question) — Peter Cordes, Jun 29 '21 at 00:05
@PeterCordes I thought the OP was confused about what the textbook question meant so was trying to give a simple answer so they could see broadly how unrolling works. I'll fix the preamble re branching once I've read your references. — John D, Jun 29 '21 at 00:39
Yeah, IDK whether the querent just needs the super basics of a naive unroll laid out, or what. And that's probably useful in general / in theory. Just don't expect it to help performance much if at all on real CPUs. (Maybe doing something about the serial dependency is the next exercise in the textbook.) There's certainly useful stuff in this answer, especially about getting the loop condition right: that comes up in SIMD loops all the time. — Peter Cordes, Jun 29 '21 at 01:04
Another point is that modern *compilers* can unroll for us (especially clang), and manual unrolling can actually defeat that (or more often defeat auto-vectorization), vs. a more idiomatic loop. But if your compiler doesn't already do even better than whatever manual unrolling you come up with, it can sometimes help. Some loops do bottleneck on front-end throughput, not latency. — Peter Cordes, Jun 29 '21 at 01:05

how to optimize this code with unrolling factor 3?

1 Answers1