I need some help to understand how an optimization I tried is even working.
The cumsum
function gets a vector, and writes a vector with the accumulated sum.
I tried the following to optimize this: instead of doing one loop over the entire vector, I wrote a loop that runs at the same time across each fourth of the vector. Then each part is adjusted to take into account the sum of the preceding parts. The result is slightly different, but it's not a problem.
Here is the program:
module cumsum_mod
implicit none
integer, parameter, private :: dp = kind(1d0)
contains
! cumsum in one straight loop
subroutine cumsum1(n, a, b)
integer :: n, i
real(dp) :: a(n), b(n)
b(1) = a(1)
do i = 2, n
b(i) = a(i) + b(i-1)
end do
end subroutine
subroutine cumsum2(n, a, b)
integer :: n, i, m
real(dp) :: a(n), b(n)
m = n/4
! Loop over the four parts
b(1) = a(1)
b(1+m) = a(1+m)
b(1+2*m) = a(1+2*m)
b(1+3*m) = a(1+3*m)
do i = 2, m
b(i) = a(i) + b(i-1)
b(i+m) = a(i+m) + b(i+m-1)
b(i+2*m) = a(i+2*m) + b(i+2*m-1)
b(i+3*m) = a(i+3*m) + b(i+3*m-1)
end do
! Adjusting
b(m+1:2*m) = b(m+1:2*m) + b(m)
b(2*m+1:3*m) = b(2*m+1:3*m) + b(2*m)
b(3*m+1:4*m) = b(3*m+1:4*m) + b(3*m)
do i = 4*m+1, n
b(i) = a(i) + b(i-1)
end do
end subroutine
subroutine cumsum3(n, a, b)
integer :: n, i, m
real(dp) :: a(n), b(n)
real(dp) :: k1, k2, k3
m = n/4
! Loop over the four parts
b(1) = a(1)
b(1+m) = a(1+m)
b(1+2*m) = a(1+2*m)
b(1+3*m) = a(1+3*m)
do i = 2, m
b(i) = a(i) + b(i-1)
b(i+m) = a(i+m) + b(i+m-1)
b(i+2*m) = a(i+2*m) + b(i+2*m-1)
b(i+3*m) = a(i+3*m) + b(i+3*m-1)
end do
! Adjusting
k1 = b(m)
k2 = b(2*m) + k1
k3 = b(3*m) + k2
do i = 1, m
b(i+m) = b(i+m) + k1
b(i+2*m) = b(i+2*m) + k2
b(i+3*m) = b(i+3*m) + k3
end do
do i = 4*m+1, n
b(i) = a(i) + b(i-1)
end do
end subroutine
end module
program cumsum_test
use cumsum_mod
implicit none
integer, parameter :: dp = kind(1d0)
real(dp), allocatable :: a(:), b(:), b1(:), b2(:), b3(:)
integer :: n, m, i
real(dp) :: t1, t2
read *, n, m
allocate (a(n), b(n), b1(n), b2(n), b3(n))
call random_number(a)
! Heating up
do i = 1, 20
call cumsum1(n, a, b)
call cumsum2(n, a, b)
call cumsum3(n, a, b)
end do
call cpu_time(t1)
do i = 1, m
call cumsum1(n, a, b)
end do
call cpu_time(t2)
print *, t2-t1
call cpu_time(t1)
do i = 1, m
call cumsum2(n, a, b)
end do
call cpu_time(t2)
print *, t2-t1
call cpu_time(t1)
do i = 1, m
call cumsum3(n, a, b)
end do
call cpu_time(t2)
print *, t2-t1
! Quick check of the difference
call cumsum1(n, a, b1)
call cumsum2(n, a, b2)
call cumsum3(n, a, b3)
print *, maxval(abs(b2-b1))
print *, maxval(abs(b3-b1))
deallocate (a, b, b1, b2, b3)
end program
The "optimized" way comes in two flavors: either the adjustment is done with four vector statements, e.g. b(m+1:2*m) = b(m+1:2*m) + b(m)
, either with a loop over the three parts at the same time.
I compile with Intel Fortran, with /O3 /QxHost
, and the processor is an Intel Core i7 9700F (i.e. it has AVX2).
I am not going to post the whole assembly output (250 kb), but here is the main part of the first loop:
cumsum1
L1:
inc rcx
vaddsd xmm0, xmm0, QWORD PTR [8+rdx+r10]
vmovsd QWORD PTR [8+rdx+r8], xmm0
vaddsd xmm0, xmm0, QWORD PTR [16+rdx+r10]
vmovsd QWORD PTR [16+rdx+r8], xmm0
add rdx, 16
cmp rcx, r11
jb L1
cumsum2 and cumsum3
L1:
vaddsd xmm0, xmm0, QWORD PTR [8+rdi+r14*8]
vaddsd xmm1, xmm1, QWORD PTR [8+r12+r14*8]
vaddsd xmm2, xmm2, QWORD PTR [8+r11+r14*8]
vaddsd xmm3, xmm3, QWORD PTR [8+r9+r14*8]
vmovsd QWORD PTR [8+r8+r14*8], xmm0
vmovsd QWORD PTR [8+rcx+r14*8], xmm1
vmovsd QWORD PTR [8+rbp+r14*8], xmm2
vmovsd QWORD PTR [8+rsi+r14*8], xmm3
inc r14
cmp r14, r10
jb L1
Both are thus essentially "scalar" loops: no parallel SIMD operation is involved. The adjustment, however, involves parallel instructions as expected.
Now, the question: why is cumsum2
/cumsum3
faster than cumsum1
? That is, 3x faster for length 1000, 2x for length 10000 and 30-60% for length 100000:
length=1000 loops=100000000
cumsum1 85.8593750000000
cumsum2 27.2187500000000
cumsum3 28.0937500000000
length=10000 loops=1000000
cumsum1 8.78125000000000
cumsum2 4.51562500000000
cumsum3 4.56250000000000
length=100000 loops=1000000
cumsum1 87.8281250000000
cumsum2 52.9687500000000
cumsum3 63.3281250000000
My initial thought was that maybe the additions would be done in parallel, resulting in some improvement. However, that part of the code is sequential in all versions, and there is additional work afterwards to adjust the values, and cumsum2
/cumsum3
are still much faster than cumsum1
.
I suspect it's related to the memory cache, but I don't see how.
Any help is welcome! And if there are clever ways to make this even faster, I'm also interested.
Edit, to answer a comment
The second loop in cumsum3:
L1:
vaddpd ymm6, ymm5, YMMWORD PTR [rcx+r10*8]
vmovupd YMMWORD PTR [rcx+r10*8], ymm6
vaddpd ymm6, ymm4, YMMWORD PTR [rbp+r10*8]
vmovupd YMMWORD PTR [rbp+r10*8], ymm6
vaddpd ymm6, ymm3, YMMWORD PTR [rsi+r10*8]
vmovupd YMMWORD PTR [rsi+r10*8], ymm6
vaddpd ymm6, ymm5, YMMWORD PTR [32+rcx+r10*8]
vmovupd YMMWORD PTR [32+rcx+r10*8], ymm6
vaddpd ymm6, ymm4, YMMWORD PTR [32+rbp+r10*8]
vmovupd YMMWORD PTR [32+rbp+r10*8], ymm6
vaddpd ymm6, ymm3, YMMWORD PTR [32+rsi+r10*8]
vmovupd YMMWORD PTR [32+rsi+r10*8], ymm6
vaddpd ymm6, ymm5, YMMWORD PTR [64+rcx+r10*8]
vmovupd YMMWORD PTR [64+rcx+r10*8], ymm6
vaddpd ymm6, ymm4, YMMWORD PTR [64+rbp+r10*8]
vmovupd YMMWORD PTR [64+rbp+r10*8], ymm6
vaddpd ymm6, ymm3, YMMWORD PTR [64+rsi+r10*8]
vmovupd YMMWORD PTR [64+rsi+r10*8], ymm6
vaddpd ymm6, ymm5, YMMWORD PTR [96+rcx+r10*8]
vmovupd YMMWORD PTR [96+rcx+r10*8], ymm6
vaddpd ymm6, ymm4, YMMWORD PTR [96+rbp+r10*8]
vmovupd YMMWORD PTR [96+rbp+r10*8], ymm6
vaddpd ymm6, ymm3, YMMWORD PTR [96+rsi+r10*8]
vmovupd YMMWORD PTR [96+rsi+r10*8], ymm6
add r10, 16
cmp r10, r9
jb L1
While with cumsum2, the same is achieved with three loops, each one like:
L1:
vaddpd ymm2, ymm1, YMMWORD PTR [rcx+r11*8]
vaddpd ymm3, ymm1, YMMWORD PTR [32+rcx+r11*8]
vaddpd ymm4, ymm1, YMMWORD PTR [64+rcx+r11*8]
vaddpd ymm5, ymm1, YMMWORD PTR [96+rcx+r11*8]
vmovupd YMMWORD PTR [rcx+r11*8], ymm2
vmovupd YMMWORD PTR [32+rcx+r11*8], ymm3
vmovupd YMMWORD PTR [64+rcx+r11*8], ymm4
vmovupd YMMWORD PTR [96+rcx+r11*8], ymm5
add r11, 16
cmp r11, r12
jb L1