0

Suppose

for (int i = 0; i < N; ++i) {
  sum += nums[i];
}

is unrolled by the compiler to something that looks like

for (int i = 0; i < N; i += 4) {
  sum1 += data[i];
  sum2 += data[i+1];
  sum3 += data[i+2];
  sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;

What happens if loops are unrolled beyond the bounds of when they should have been? For example, if N is 101, then i will at some point be 100 but indices 101, 102, 103 are past the bounds of the array.

Maybe we want to be conservative when unrolling and compute the end of the loop like normal -- but even so, where does a conservative guess for how much the loop can be unrolled come from if the number of iterations is determined at runtime?

bun9
  • 161
  • 8
  • That's a software bug, nothing to do with pipelined processors. The loop conditions needs to be `i < N-3`, and then you need a cleanup loop to handle the last 0..3 elements. – Peter Cordes Feb 26 '22 at 01:56
  • @PeterCordes Okay thanks, that makes sense. The compiler knows how many (original) iterations it has unrolled in each new iteration so makes sure there is no overflow. The reason I was confused at first is because I was imagining unrolling as: do k iterations outside of any loop and then looping over the rest and this code sample where unrolled code is all within some loop was pulled from somewhere else and I didn't think about how it is different. – bun9 Feb 26 '22 at 07:08
  • I wonder if you got that 2nd code from [When, if ever, is loop unrolling still useful?](https://stackoverflow.com/a/2349265) ? Another loop-unrolling question led me to find that Q&A as a duplicate, and I ended up fixing the answer there to not have this loop-bound bug. Then the new answer here made me look at this question again and recognize the code. In future, if you copied code from somewhere, it's good to mention that and link it, so people answering can see what context it came from that explained something. Also just to give credit to the author. – Peter Cordes Feb 27 '22 at 02:35
  • Yes got it from that post. Will do that next time. Thanks for your help. – bun9 May 08 '22 at 15:26

1 Answers1

0

Pete's right about the code missing the last 3 elements. Regarding your question about the pipeline...

Pipelined processors have a heavy branch penalty, they do not like excessively tight loops. At the assembly level, a pipelined processor will have access to repeat single and repeat block instructions.

In a high level language, unrolling can be performed by the compiler or by the developer. In the code you listed, the developer code performs 4 sums per each cycle.

  • *Pipelined processors have a heavy branch penalty, they do not like excessively tight loops* - Most modern CPUs can run tight loops at 1 iteration per clock, although unrolling to reduce loop overhead is helpful. Correctly-predicted branches have some front-end penalty, but a loop buffer can help some. (e.g. [Is performance reduced when executing loops whose uop count is not a multiple of processor width?](https://stackoverflow.com/q/39311872) shows how Haswell and later actually "unroll" tight loops of odd sizes in their loop buffers.) – Peter Cordes Feb 27 '22 at 02:29
  • Most ISAs don't have a "repeat single" or especially not "repeat block" instruction, just normal branches. Did you have something in mind for those? Are you talking about x86's `rep` prefix, which only applies to a few special instructions like `rep movsb`? (https://github.com/HJLebbink/asm-dude/wiki/REP_REPE_REPZ_REPNE_REPNZ / [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) is fast only because microcode handles that combination as wide copies, not actually repeating a byte-copy instruction). If you meant something else, what ISA? – Peter Cordes Feb 27 '22 at 02:31