The only thing you should need is to enable auto-vectorization with your compiler.
For example, compiling your code (assuming float) with GCC (5.2.0) -O3
produces this main loop
L8:
movups (%rsi,%rax), %xmm1
addl $1, %r11d
mulps %xmm2, %xmm1
addps (%rdi,%rax), %xmm1
movaps %xmm1, (%rdi,%rax)
addq $16, %rax
cmpl %r11d, %r10d
ja .L8
Clang also vectorizes the loop but also unrolls four times. Unrolling may help on some processors even though there is no dependency chain especially on Haswell. In fact, you can get GCC to unroll by adding -funroll-loops
. GCC will unroll to eight independent operations in this case unlike in the case when there is a dependency chain.
One problem you may encounter is that your compiler may need to add some code to determine if the arrays overlap and make two branches one without vectorization for when they do overlap and one with vectorization for when they don't overlap. GCC and Clang both do this. But ICC does not vectorize the loop.
ICC 13.0.01 with -O3
..B1.4: # Preds ..B1.2 ..B1.4
movss (%rsi), %xmm1 #3.21
incl %ecx #2.5
mulss %xmm0, %xmm1 #3.28
addss (%rdi), %xmm1 #3.11
movss %xmm1, (%rdi) #3.11
movss 4(%rsi), %xmm2 #3.21
addq $8, %rsi #3.21
mulss %xmm0, %xmm2 #3.28
addss 4(%rdi), %xmm2 #3.11
movss %xmm2, 4(%rdi) #3.11
addq $8, %rdi #3.11
cmpl %eax, %ecx #2.5
jb ..B1.4 # Prob 63% #2.5
To fix this you need to tell the compiler the arrays don't overlap using the __restrict
keyword.
void foo(float * __restrict a, float * __restrict b, float k, int size) {
while (size--) {
(*a++) += (*b++) * k;
}
}
In this case ICC produces two branches. One for when the arrays are 16 byte aligned and one for when they are not. Here is the aligned branch
..B1.16: # Preds ..B1.16 ..B1.15
movaps (%rsi), %xmm2 #3.21
addl $8, %r8d #2.5
movaps 16(%rsi), %xmm3 #3.21
addq $32, %rsi #1.6
mulps %xmm1, %xmm2 #3.28
mulps %xmm1, %xmm3 #3.28
addps (%rdi), %xmm2 #3.11
addps 16(%rdi), %xmm3 #3.11
movaps %xmm2, (%rdi) #3.11
movaps %xmm3, 16(%rdi) #3.11
addq $32, %rdi #1.6
cmpl %ecx, %r8d #2.5
jb ..B1.16 # Prob 82% #2.5
ICC unrolls twice in both cases. Even though GCC and Clang produce a vectorized and unvectorize branch without __restrict
you may want to use __restrict
anyway to remove the overhead of the code to determine which branch to use.
The last thing you can try is to to tell the compiler the arrays are aligned. This will work with GCC and Clang (3.6)
void foo(float * __restrict a, float * __restrict b, float k, int size) {
a = (float*)__builtin_assume_aligned (a, 32);
b = (float*)__builtin_assume_aligned (b, 32);
while (size--) {
(*a++) += (*b++) * k;
}
}
GCC produces in this case
.L4:
movaps (%rsi,%r8), %xmm1
addl $1, %r10d
mulps %xmm2, %xmm1
addps (%rdi,%r8), %xmm1
movaps %xmm1, (%rdi,%r8)
addq $16, %r8
cmpl %r10d, %eax
ja .L4
Lastly if you compiler supports OpenMP 4.0 you can use OpenMP like this
void foo(float * __restrict a, float * __restrict b, float k, int size) {
#pragma omp simd aligned(a:32) aligned(b:32)
for(int i=0; i<size; i++) {
a[i] += k*b[i];
}
}
GCC produces the same code in this case as when using __builtin_assume_aligned
. This should work for a more recent version of ICC (which I don't have).
I did not check MSVC. I expect it vectorizes this loop as well.
For more details about restrict
and the compiler producing different branches with and without overlap and for aligned and not aligned see
sum-of-overlapping-arrays-auto-vectorization-and-restrict.
Here is one more suggestion to consider. If you know that the range of the loop is a multiple of the the SIMD width the compiler will not have to use cleanup code. The following code
// gcc -O3
// n = size/8
void foo(float * __restrict a, float * __restrict b, float k, int n) {
a = (float*)__builtin_assume_aligned (a, 32);
b = (float*)__builtin_assume_aligned (b, 32);
//#pragma omp simd aligned(a:32) aligned(b:32)
for(int i=0; i<n*8; i++) {
a[i] += k*b[i];
}
}
produces the simplest assembly so far.
foo(float*, float*, float, int):
sall $2, %edx
testl %edx, %edx
jle .L1
subl $4, %edx
shufps $0, %xmm0, %xmm0
shrl $2, %edx
xorl %eax, %eax
xorl %ecx, %ecx
addl $1, %edx
.L4:
movaps (%rsi,%rax), %xmm1
addl $1, %ecx
mulps %xmm0, %xmm1
addps (%rdi,%rax), %xmm1
movaps %xmm1, (%rdi,%rax)
addq $16, %rax
cmpl %edx, %ecx
jb .L4
.L1:
rep ret
I used a multiple8
and 32 byte alignment because then just by using the compiler switch -mavx
the compiler produces nice AVX vectorization.
foo(float*, float*, float, int):
sall $3, %edx
testl %edx, %edx
jle .L5
vshufps $0, %xmm0, %xmm0, %xmm0
subl $8, %edx
xorl %eax, %eax
shrl $3, %edx
xorl %ecx, %ecx
addl $1, %edx
vinsertf128 $1, %xmm0, %ymm0, %ymm0
.L4:
vmulps (%rsi,%rax), %ymm0, %ymm1
addl $1, %ecx
vaddps (%rdi,%rax), %ymm1, %ymm1
vmovaps %ymm1, (%rdi,%rax)
addq $32, %rax
cmpl %edx, %ecx
jb .L4
vzeroupper
.L5:
rep ret
I am not sure how the preamble could be made simpler but the only improvement I see left is to remove one of the iterators and a compare. Namely the addl $1, %ecx
instruction should not be necessary. Niether should the cmpl %edx, %ecx
be necessary. I'm not sure how to get GCC to fix this. I had a problem like before with GCC (Produce loops without cmp instruction in GCC).