Explanation of std::vector sum ASM

Question

I was playing around with the Compiler Explorer, and I'm struggling to understand the ASM output (x86 Clang 3.7 -O3) of a simple std::vector<int> sum function:

#include <vector>
#include <numeric>

int sum(const std::vector<int>& v)
{
    return std::accumulate(v.begin(), v.end(), 0);
}

The ASM for this code is:

sum(std::vector<int, std::allocator<int> > const&):              # @sum(std::vector<int, std::allocator<int> > const&)
        movq    (%rdi), %rsi
        movq    8(%rdi), %r11
        xorl    %eax, %eax
        cmpq    %r11, %rsi
        je      .LBB0_13
        movabsq $9223372036854775800, %rax # imm = 0x7FFFFFFFFFFFFFF8
        leaq    -4(%r11), %rdx
        movq    %rdx, %r10
        subq    %rsi, %r10
        shrq    $2, %r10
        incq    %r10
        xorl    %edi, %edi
        movq    %r10, %r8
        andq    %rax, %r8
        pxor    %xmm0, %xmm0
        je      .LBB0_2
        andq    %r10, %rax
        leaq    -8(%rax), %r9
        movl    %r9d, %ecx
        shrl    $3, %ecx
        incl    %ecx
        xorl    %edi, %edi
        testb   $3, %cl
        je      .LBB0_4
        subl    %esi, %edx
        shrl    $2, %edx
        incl    %edx
        andl    $24, %edx
        addl    $-8, %edx
        shrl    $3, %edx
        incl    %edx
        andl    $3, %edx
        negq    %rdx
        pxor    %xmm0, %xmm0
        xorl    %edi, %edi
        pxor    %xmm1, %xmm1
.LBB0_6:                                # %vector.body.prol
        movdqu  (%rsi,%rdi,4), %xmm2
        movdqu  16(%rsi,%rdi,4), %xmm3
        paddd   %xmm2, %xmm0
        paddd   %xmm3, %xmm1
        addq    $8, %rdi
        incq    %rdx
        jne     .LBB0_6
        jmp     .LBB0_7
.LBB0_2:
        pxor    %xmm1, %xmm1
        jmp     .LBB0_11
.LBB0_4:
        pxor    %xmm0, %xmm0
        pxor    %xmm1, %xmm1
.LBB0_7:                                # %vector.body.preheader.split
        leaq    (%rsi,%r8,4), %rdx
        cmpq    $24, %r9
        jb      .LBB0_10
        subq    %rdi, %rax
        leaq    112(%rsi,%rdi,4), %rsi
.LBB0_9:                                # %vector.body
        movdqu  -112(%rsi), %xmm2
        movdqu  -96(%rsi), %xmm3
        movdqu  -80(%rsi), %xmm4
        movdqu  -64(%rsi), %xmm5
        paddd   %xmm0, %xmm2
        paddd   %xmm1, %xmm3
        paddd   %xmm4, %xmm2
        paddd   %xmm5, %xmm3
        movdqu  -48(%rsi), %xmm4
        movdqu  -32(%rsi), %xmm5
        paddd   %xmm2, %xmm4
        paddd   %xmm3, %xmm5
        movdqu  -16(%rsi), %xmm0
        movdqu  (%rsi), %xmm1
        paddd   %xmm4, %xmm0
        paddd   %xmm5, %xmm1
        subq    $-128, %rsi
        addq    $-32, %rax
        jne     .LBB0_9
.LBB0_10:
        movq    %rdx, %rsi
        movq    %r8, %rdi
.LBB0_11:                               # %middle.block
        paddd   %xmm1, %xmm0
        pshufd  $78, %xmm0, %xmm1       # xmm1 = xmm0[2,3,0,1]
        paddd   %xmm0, %xmm1
        pshufd  $229, %xmm1, %xmm0      # xmm0 = xmm1[1,1,2,3]
        paddd   %xmm1, %xmm0
        movd    %xmm0, %eax
        cmpq    %rdi, %r10
        je      .LBB0_13
.LBB0_12:                               # %.lr.ph.i
        addl    (%rsi), %eax
        addq    $4, %rsi
        cmpq    %rsi, %r11
        jne     .LBB0_12
.LBB0_13:                               # %int std::accumulate<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int>(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, int) [clone .exit]
        req

For comparison, the ASM for the same function but using std::vector<double> is:

sum(std::vector<double, std::allocator<double> > const&):
        movq    8(%rdi), %rdx
        movq    (%rdi), %rax
        pxor    %xmm0, %xmm0
        cmpq    %rax, %rdx
        je      .L4
.L3:
        addsd   (%rax), %xmm0
        addq    $8, %rax
        cmpq    %rax, %rdx
        jne     .L3
        rep ret
.L4:
        rep ret

The ASM for std::vector<double> seems fairly trivial, while the ASM for std::vector<int> appears markedly more complex. I'm assuming there is some clever optimisation going on with std::vector<int>, but I'm at a bit of a loss to explain what's going on. Could someone enlighten me?

You can see the difference with `-fno-vectorize` and `-fno-unroll-loops`. — Brett Hale, Dec 14 '15 at 11:45
You'd get a more interesting vectorized (`addpd` instead of `addsd`) version with `-ffast-math`, and also `-mavx` to allow 256b vectors. — Peter Cordes, Dec 14 '15 at 22:37

score 7 · Answer 1 · edited May 23 '17 at 11:52

Short answer - the compiler has vectorised and unrolled the loop for adding integers. Compare the vector<double> version which has these lines:

addsd   (%rax), %xmm0
addq    $8, %rax

This means its adding a single double into the sum and then moving on 8 bytes and looping.

The same code in the main loop of the vector<int> version does:

movdqu  -112(%rsi), %xmm2
movdqu  -96(%rsi), %xmm3
movdqu  -80(%rsi), %xmm4
movdqu  -64(%rsi), %xmm5
...
movdqu  -48(%rsi), %xmm4
movdqu  -32(%rsi), %xmm5
...
movdqu  -16(%rsi), %xmm0
...
movdqu  (%rsi), %xmm1
...
subq    $-128, %rsi

The movdq shows its doing 16 bytes at once (4 integers) and the subq $-128, %rsi shows its doing 128 bytes (or 32 ints) in a single loop across 8 loads. The end result of each iteration of the loop adds the next 32 ints into one of the 8 slots in xmm0:xmm1

LBB0_11 then takes the output from the main loop (which is 8 integers across xmm0 and xmm1) and finds the sum of those.

LBB0_12 then finishes off any ints at the end of the vector which couldn't be consumed by the main loop (as the main loop works on 32 integers at the same time)

It vectorises the adds so it can handle 4 integers at once which is generally faster than doing one integer at a time. It also unrolls the loop so that it can do more than 1 iteration of adding per loop.

Explanation of vectorization: What does vectorization mean?

Explanation of loop unrolling: When, if ever, is loop unrolling still useful?

I've not analysed the start of the code for the integer case, but generally this is setting up the loop by aligning it to a 16 byte boundary before it starts the main loop.

Note that something very similar happens for `std::vector` with the compiler option `-ffast-math`. — Konrad Rudolph, Dec 14 '15 at 11:43

Explanation of std::vector sum ASM

1 Answers1