Optimizations in copying a range

Question

While reading sources of GNU C++ standard library, I found some code for copying (or moving, if possible) a range of iterators (file stl_algobase.h), which uses template specialization for some optimizations. A comment corresponding to it says:

All of these auxiliary structs serve two purposes. (1) Replace calls to copy with memmove whenever possible. (Memmove, not memcpy, because the input and output ranges are permitted to overlap.) (2) If we're using random access iterators, then write the loop as a for loop with an explicit count.

The specialization using the second optimization looks like this:

template<>
struct __copy_move<false, false, random_access_iterator_tag>
{
  template<typename _II, typename _OI>
    static _OI
    __copy_m(_II __first, _II __last, _OI __result)
    { 
  typedef typename iterator_traits<_II>::difference_type _Distance;
  for(_Distance __n = __last - __first; __n > 0; --__n)
    {
      *__result = *__first;
      ++__first;
      ++__result;
    }
  return __result;
  }
};

So, I have two questions concerning this

How can memmove increase the spead of copying? Is it implemented somehow more effective than a simple loop?
How can using explicit counter in the for loop affect the performance?

Some clarification: I would like to see some optimization examples actually used by compilers, not elaboration on the possibility of those.

Edit: the first question is quite nicely answered here.

The explicit counter may enable the compiler to perform some optimizations, such as loop unrolling. — Thomas Matthews, Feb 27 '15 at 23:06

score 1 · Accepted Answer · answered Mar 03 '15 at 20:00

Answering the second question, the explicit count does indeed lead to more opportunities for loop unrolling, though even with pointers iterating through a fixed size array, gcc does not perform aggressive unrolling unless asked to do so with -funroll-loops. The other gain comes from a potentially simpler end-of-loop comparison test for non-trivial iterators.

On a Core i7-4770, I benchmarked the time spent performing a copy of a maximally-aligned 2048-long integer array with a while loop and explicit count copy implementation. (Times in microseconds, includes call overhead; minimum of 200 samples of a timing loop with warm-up.)

                                            while      count

gcc -O3                                     0.179      0.178
gcc -O3 -march=native                       0.097      0.095
gcc -O3 -march=native -funroll-loops        0.066      0.066

In each case, the generated code is very similar; the while version does a bit more work at the end in each case, handling checks that there aren't any entries left to copy that didn't fill out a whole 128-bit (SSE) or 256-bit (AVX) register, but these are pretty much taken care of by the branch predictor. The gcc -O3 assembly for each is as follows (leaving out assembler directives). while version:

array_copy_while(int (&) [2048], int (&) [2048]):
        leaq    8192(%rdi), %rax
        leaq    4(%rdi), %rdx
        movq    %rax, %rcx
        subq    %rdx, %rcx
        movq    %rcx, %rdx
        shrq    $2, %rdx
        leaq    1(%rdx), %r8
        cmpq    $8, %r8
        jbe     .L11
        leaq    16(%rsi), %rdx
        cmpq    %rdx, %rdi
        leaq    16(%rdi), %rdx
        setae   %cl
        cmpq    %rdx, %rsi
        setae   %dl
        orb     %dl, %cl
        je      .L11
        movq    %r8, %r9
        xorl    %edx, %edx
        xorl    %ecx, %ecx
        shrq    $2, %r9
        leaq    0(,%r9,4), %r10
.L9:
        movdqa  (%rdi,%rdx), %xmm0
        addq    $1, %rcx
        movdqa  %xmm0, (%rsi,%rdx)
        addq    $16, %rdx
        cmpq    %rcx, %r9
        ja      .L9
        leaq    0(,%r10,4), %rdx
        addq    %rdx, %rdi
        addq    %rdx, %rsi
        cmpq    %r10, %r8
        je      .L1
        movl    (%rdi), %edx
        movl    %edx, (%rsi)
        leaq    4(%rdi), %rdx
        cmpq    %rdx, %rax
        je      .L1
        movl    4(%rdi), %edx
        movl    %edx, 4(%rsi)
        leaq    8(%rdi), %rdx
        cmpq    %rdx, %rax
        je      .L20
        movl    8(%rdi), %eax
        movl    %eax, 8(%rsi)
        ret
.L11:
        movl    (%rdi), %edx
        addq    $4, %rdi
        addq    $4, %rsi
        movl    %edx, -4(%rsi)
        cmpq    %rdi, %rax
        jne     .L11
.L1:
        rep ret
.L20:
        rep ret

count version:

array_copy_count(int (&) [2048], int (&) [2048]):
        leaq    16(%rsi), %rax
        movl    $2048, %ecx
        cmpq    %rax, %rdi
        leaq    16(%rdi), %rax
        setae   %dl
        cmpq    %rax, %rsi
        setae   %al
        orb     %al, %dl
        je      .L23
        movw    $512, %cx
        xorl    %eax, %eax
        xorl    %edx, %edx
.L29:
        movdqa  (%rdi,%rax), %xmm0
        addq    $1, %rdx
        movdqa  %xmm0, (%rsi,%rax)
        addq    $16, %rax
        cmpq    %rdx, %rcx
        ja      .L29
        rep ret
.L23:
        xorl    %eax, %eax
.L31:
        movl    (%rdi,%rax,4), %edx
        movl    %edx, (%rsi,%rax,4)
        addq    $1, %rax
        cmpq    %rax, %rcx
        jne     .L31
        rep ret

When the iterators are more complicated however, the difference becomes more pronounced. Consider a hypothetical container that stores values in a series of fixed-sized allocated buffers. An iterator comprises a pointer to the chain of blocks, a block index and a block offset. Comparison of two iterators requires potentially two comparisons. Incrementing the iterator requires checking if we pop over a block boundary.

I made such a container, and performed the same benchmark for copying a 2000-long container of int, with a block size of 512 ints.

                                            while      count

gcc -O3                                     1.560      2.818
gcc -O3 -march=native                       1.660      2.854
gcc -O3 -march=native -funroll-loops        1.432      2.858

That looks weird! Oh wait, it's because gcc 4.8 has a misoptimisation, where it uses conditional moves instead of nice, branch-predictor friendly comparisons. (gcc bug 56309).

Let's try icc on a different machine (Xeon E5-2670).

                                            while      count

icc -O3                                     3.952      3.704
icc -O3 -xHost                              3.898      3.624

This is closer to what we'd expect, a small but significant improvement from the simpler loop condition. On a different architecture, the gain is more pronounced. clang targeting a PowerA2 at 1.6GHz:

                                            while      count

bgclang -O3                                36.528     31.623

I'll omit the assembly, as it's quite long!

Optimizations in copying a range

1 Answers1