All the below code are using -O3 -xHost (or -Ofast -march=native in gfortran) optimization.
example1
integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:)=a(:)+b(:)
return
end
The above code should be vectorized,right? The real is real 8 type, it should took 64 bit in the ram. For AVX2 it is with 256 bit register or something, The above code is like doing c(1:4) = a(1:4) + b(1:4) in one clock cycle or something, and 4 is because 256/64=4.
But why do I frequently find that actually doing the loop is faster? see example 2,
example2
integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
integer :: i
do i=1,1000
c(i)=a(i)+b(i)
endddo
return
end
I notice that the compiler in example1 may complain that the array are not aligned or something. Therefore actually example2 is faster than example1. But I mean in principle both example1 and 2 should be the same speed right? In example2, the compiler should be smart enough to vectorize the code correspondingly.
Finally, in the below code, example3, is there any vectorization?
example3
integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:) = exp(a(:)) + log(b(:))
return
end
or slightly complicated things like, example4 example4
integer, parameter :: r8=selected_real_kind(15,9)
subroutine f(a,b,c)
real(kind=r8) :: a(1000), b(1000), c(1000)
c(:) = exp(a(:)) + log(b(:)) + exp(a(:)) * log(b(:))
return
end
I mean is vectorization is only for those very basic operatons like, op = + - * /
a(:) = b(:) op c(:)
However, for something more complicated, like
a(:) = log(b(:) + c(:)*exp(a(:))*ln(b(:)))
Then vectorization may not work?
It seems in many case using a do loop is faster than writing things like a(:)=b(:)+c(:). It seem the compiler are doing very good job or highly optimized in just doing the do loops.