3

Suppose I have two vectors a and b, stored as a vector. I want to make a += b or a +=b * k, where k is a number.

I can for sure do the following,

while (size--) {
    (*a++) += (*b++) * k;
}

But what are the possible ways to easily leverage SIMD instructions such as SSE2?

Paul R
  • 208,748
  • 37
  • 389
  • 560
Nan Hua
  • 3,414
  • 3
  • 17
  • 24
  • 1
    when you say vector, do you mean std::vector or a native array? – AndersK Oct 22 '15 at 05:51
  • 1
    Are you sure you want to use the increment operators in the same line? They make it hard to read for me at least. – therainmaker Oct 22 '15 at 05:52
  • 1
    How about adopting `Eigen`? – findall Oct 22 '15 at 05:54
  • What is the underlying data type ? `int` ? `float` ? `double` ? – Paul R Oct 22 '15 at 07:33
  • 1
    Note that current compilers are often quite good at applying SIMD optimisation to simple loops such as this already, so there may be little or no point in hand-optimising - check the generated code for your current implementation to see if it's already using SSE instructions before you dive in. – Paul R Oct 22 '15 at 07:47

2 Answers2

7

The only thing you should need is to enable auto-vectorization with your compiler.

For example, compiling your code (assuming float) with GCC (5.2.0) -O3 produces this main loop

L8:
    movups  (%rsi,%rax), %xmm1
    addl    $1, %r11d
    mulps   %xmm2, %xmm1
    addps   (%rdi,%rax), %xmm1
    movaps  %xmm1, (%rdi,%rax)
    addq    $16, %rax
    cmpl    %r11d, %r10d
    ja  .L8

Clang also vectorizes the loop but also unrolls four times. Unrolling may help on some processors even though there is no dependency chain especially on Haswell. In fact, you can get GCC to unroll by adding -funroll-loops. GCC will unroll to eight independent operations in this case unlike in the case when there is a dependency chain.

One problem you may encounter is that your compiler may need to add some code to determine if the arrays overlap and make two branches one without vectorization for when they do overlap and one with vectorization for when they don't overlap. GCC and Clang both do this. But ICC does not vectorize the loop.

ICC 13.0.01 with -O3

..B1.4:                         # Preds ..B1.2 ..B1.4
        movss     (%rsi), %xmm1                                 #3.21
        incl      %ecx                                          #2.5
        mulss     %xmm0, %xmm1                                  #3.28
        addss     (%rdi), %xmm1                                 #3.11
        movss     %xmm1, (%rdi)                                 #3.11
        movss     4(%rsi), %xmm2                                #3.21
        addq      $8, %rsi                                      #3.21
        mulss     %xmm0, %xmm2                                  #3.28
        addss     4(%rdi), %xmm2                                #3.11
        movss     %xmm2, 4(%rdi)                                #3.11
        addq      $8, %rdi                                      #3.11
        cmpl      %eax, %ecx                                    #2.5
        jb        ..B1.4        # Prob 63%                      #2.5

To fix this you need to tell the compiler the arrays don't overlap using the __restrict keyword.

void foo(float * __restrict a, float * __restrict b, float k, int size) {
    while (size--) {
        (*a++) += (*b++) * k;
    }
}

In this case ICC produces two branches. One for when the arrays are 16 byte aligned and one for when they are not. Here is the aligned branch

..B1.16:                        # Preds ..B1.16 ..B1.15
        movaps    (%rsi), %xmm2                                 #3.21
        addl      $8, %r8d                                      #2.5
        movaps    16(%rsi), %xmm3                               #3.21
        addq      $32, %rsi                                     #1.6
        mulps     %xmm1, %xmm2                                  #3.28
        mulps     %xmm1, %xmm3                                  #3.28
        addps     (%rdi), %xmm2                                 #3.11
        addps     16(%rdi), %xmm3                               #3.11
        movaps    %xmm2, (%rdi)                                 #3.11
        movaps    %xmm3, 16(%rdi)                               #3.11
        addq      $32, %rdi                                     #1.6
        cmpl      %ecx, %r8d                                    #2.5
        jb        ..B1.16       # Prob 82%                      #2.5

ICC unrolls twice in both cases. Even though GCC and Clang produce a vectorized and unvectorize branch without __restrict you may want to use __restrict anyway to remove the overhead of the code to determine which branch to use.

The last thing you can try is to to tell the compiler the arrays are aligned. This will work with GCC and Clang (3.6)

void foo(float * __restrict a, float * __restrict b, float k, int size) {
    a = (float*)__builtin_assume_aligned (a, 32);
    b = (float*)__builtin_assume_aligned (b, 32);
    while (size--) {
        (*a++) += (*b++) * k;
    }
}

GCC produces in this case

.L4:
    movaps  (%rsi,%r8), %xmm1
    addl    $1, %r10d
    mulps   %xmm2, %xmm1
    addps   (%rdi,%r8), %xmm1
    movaps  %xmm1, (%rdi,%r8)
    addq    $16, %r8
    cmpl    %r10d, %eax
    ja  .L4

Lastly if you compiler supports OpenMP 4.0 you can use OpenMP like this

void foo(float * __restrict a, float * __restrict b, float k, int size) {
    #pragma omp simd aligned(a:32) aligned(b:32)
    for(int i=0; i<size; i++) {
        a[i] += k*b[i];
    }
}

GCC produces the same code in this case as when using __builtin_assume_aligned. This should work for a more recent version of ICC (which I don't have).

I did not check MSVC. I expect it vectorizes this loop as well.

For more details about restrict and the compiler producing different branches with and without overlap and for aligned and not aligned see sum-of-overlapping-arrays-auto-vectorization-and-restrict.


Here is one more suggestion to consider. If you know that the range of the loop is a multiple of the the SIMD width the compiler will not have to use cleanup code. The following code

// gcc -O3
// n = size/8
void foo(float * __restrict a, float * __restrict b, float k, int n) {
    a = (float*)__builtin_assume_aligned (a, 32);
    b = (float*)__builtin_assume_aligned (b, 32);
    //#pragma omp simd aligned(a:32) aligned(b:32)
    for(int i=0; i<n*8; i++) {
        a[i] += k*b[i];
    }
}

produces the simplest assembly so far.

foo(float*, float*, float, int):
    sall    $2, %edx
    testl   %edx, %edx
    jle .L1
    subl    $4, %edx
    shufps  $0, %xmm0, %xmm0
    shrl    $2, %edx
    xorl    %eax, %eax
    xorl    %ecx, %ecx
    addl    $1, %edx
.L4:
    movaps  (%rsi,%rax), %xmm1
    addl    $1, %ecx
    mulps   %xmm0, %xmm1
    addps   (%rdi,%rax), %xmm1
    movaps  %xmm1, (%rdi,%rax)
    addq    $16, %rax
    cmpl    %edx, %ecx
    jb  .L4
.L1:
    rep ret

I used a multiple8 and 32 byte alignment because then just by using the compiler switch -mavx the compiler produces nice AVX vectorization.

foo(float*, float*, float, int):
    sall    $3, %edx
    testl   %edx, %edx
    jle .L5
    vshufps $0, %xmm0, %xmm0, %xmm0
    subl    $8, %edx
    xorl    %eax, %eax
    shrl    $3, %edx
    xorl    %ecx, %ecx
    addl    $1, %edx
    vinsertf128 $1, %xmm0, %ymm0, %ymm0
.L4:
    vmulps  (%rsi,%rax), %ymm0, %ymm1
    addl    $1, %ecx
    vaddps  (%rdi,%rax), %ymm1, %ymm1
    vmovaps %ymm1, (%rdi,%rax)
    addq    $32, %rax
    cmpl    %edx, %ecx
    jb  .L4
    vzeroupper
.L5:
    rep ret

I am not sure how the preamble could be made simpler but the only improvement I see left is to remove one of the iterators and a compare. Namely the addl $1, %ecx instruction should not be necessary. Niether should the cmpl %edx, %ecx be necessary. I'm not sure how to get GCC to fix this. I had a problem like before with GCC (Produce loops without cmp instruction in GCC).

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
1

The functions SAXPY (single-precision), DAXPY (double-precision), CAXPY (complex single-precision), and ZAXPY (complex double-precision) compute exactly the expression you want:

Y = a * X + Y

where a is a scalar constant, and X and Y are vectors.

These functions are provided by BLAS libraries and optimized for all practical platforms: for CPUs the best BLAS implementations are OpenBLAS, Intel MKL (optimized for Intel x86 processors and Xeon Phi co-processors only), BLIS, and Apple Accelerate (OS X only); for nVidia GPUs look at cuBLAS (part of CUDA SDK), for any GPUs - ArrayFire.

These libraries are well-optimized and deliver better performance than whatever implementation you can quickly hack up.

Marat Dukhan
  • 11,993
  • 4
  • 27
  • 41
  • 2
    My experience is that for simple vector operations, such as the one in your example the hand written loop is at least as fast as the BLAS function because it comes without the overhead of BLAS. – Chiel Oct 22 '15 at 06:35