Answering the second question, the explicit count does indeed lead to more opportunities for loop unrolling, though even with pointers iterating through a fixed size array, gcc does not perform aggressive unrolling unless asked to do so with -funroll-loops
. The other gain comes from a potentially simpler end-of-loop comparison test for non-trivial iterators.
On a Core i7-4770, I benchmarked the time spent performing a copy of a maximally-aligned 2048-long integer array with a while loop and explicit count copy implementation. (Times in microseconds, includes call overhead; minimum of 200 samples of a timing loop with warm-up.)
while count
gcc -O3 0.179 0.178
gcc -O3 -march=native 0.097 0.095
gcc -O3 -march=native -funroll-loops 0.066 0.066
In each case, the generated code is very similar; the while
version does a bit more work at the end in each case, handling checks that there aren't any entries left to copy that didn't fill out a whole 128-bit (SSE) or 256-bit (AVX) register, but these are pretty much taken care of by the branch predictor. The gcc -O3
assembly for each is as follows (leaving out assembler directives). while
version:
array_copy_while(int (&) [2048], int (&) [2048]):
leaq 8192(%rdi), %rax
leaq 4(%rdi), %rdx
movq %rax, %rcx
subq %rdx, %rcx
movq %rcx, %rdx
shrq $2, %rdx
leaq 1(%rdx), %r8
cmpq $8, %r8
jbe .L11
leaq 16(%rsi), %rdx
cmpq %rdx, %rdi
leaq 16(%rdi), %rdx
setae %cl
cmpq %rdx, %rsi
setae %dl
orb %dl, %cl
je .L11
movq %r8, %r9
xorl %edx, %edx
xorl %ecx, %ecx
shrq $2, %r9
leaq 0(,%r9,4), %r10
.L9:
movdqa (%rdi,%rdx), %xmm0
addq $1, %rcx
movdqa %xmm0, (%rsi,%rdx)
addq $16, %rdx
cmpq %rcx, %r9
ja .L9
leaq 0(,%r10,4), %rdx
addq %rdx, %rdi
addq %rdx, %rsi
cmpq %r10, %r8
je .L1
movl (%rdi), %edx
movl %edx, (%rsi)
leaq 4(%rdi), %rdx
cmpq %rdx, %rax
je .L1
movl 4(%rdi), %edx
movl %edx, 4(%rsi)
leaq 8(%rdi), %rdx
cmpq %rdx, %rax
je .L20
movl 8(%rdi), %eax
movl %eax, 8(%rsi)
ret
.L11:
movl (%rdi), %edx
addq $4, %rdi
addq $4, %rsi
movl %edx, -4(%rsi)
cmpq %rdi, %rax
jne .L11
.L1:
rep ret
.L20:
rep ret
count
version:
array_copy_count(int (&) [2048], int (&) [2048]):
leaq 16(%rsi), %rax
movl $2048, %ecx
cmpq %rax, %rdi
leaq 16(%rdi), %rax
setae %dl
cmpq %rax, %rsi
setae %al
orb %al, %dl
je .L23
movw $512, %cx
xorl %eax, %eax
xorl %edx, %edx
.L29:
movdqa (%rdi,%rax), %xmm0
addq $1, %rdx
movdqa %xmm0, (%rsi,%rax)
addq $16, %rax
cmpq %rdx, %rcx
ja .L29
rep ret
.L23:
xorl %eax, %eax
.L31:
movl (%rdi,%rax,4), %edx
movl %edx, (%rsi,%rax,4)
addq $1, %rax
cmpq %rax, %rcx
jne .L31
rep ret
When the iterators are more complicated however, the difference becomes more pronounced. Consider a hypothetical container that stores values in a series of fixed-sized allocated buffers. An iterator comprises a pointer to the chain of blocks, a block index and a block offset. Comparison of two iterators requires potentially two comparisons. Incrementing the iterator requires checking if we pop over a block boundary.
I made such a container, and performed the same benchmark for copying a 2000-long container of int
, with a block size of 512 int
s.
while count
gcc -O3 1.560 2.818
gcc -O3 -march=native 1.660 2.854
gcc -O3 -march=native -funroll-loops 1.432 2.858
That looks weird! Oh wait, it's because gcc 4.8 has a misoptimisation, where it uses conditional moves instead of nice, branch-predictor friendly comparisons. (gcc bug 56309).
Let's try icc on a different machine (Xeon E5-2670).
while count
icc -O3 3.952 3.704
icc -O3 -xHost 3.898 3.624
This is closer to what we'd expect, a small but significant improvement from the simpler loop condition. On a different architecture, the gain is more pronounced. clang targeting a PowerA2 at 1.6GHz:
while count
bgclang -O3 36.528 31.623
I'll omit the assembly, as it's quite long!