Optimizing cumulative sum

Question

I need some help to understand how an optimization I tried is even working.

The cumsum function gets a vector, and writes a vector with the accumulated sum.

I tried the following to optimize this: instead of doing one loop over the entire vector, I wrote a loop that runs at the same time across each fourth of the vector. Then each part is adjusted to take into account the sum of the preceding parts. The result is slightly different, but it's not a problem.

Here is the program:

module cumsum_mod
    implicit none
    integer, parameter, private :: dp = kind(1d0)
contains
    ! cumsum in one straight loop
    subroutine cumsum1(n, a, b)
        integer :: n, i
        real(dp) :: a(n), b(n)
        
        b(1) = a(1)
        do i = 2, n
            b(i) = a(i) + b(i-1)
        end do
    end subroutine
    
    subroutine cumsum2(n, a, b)
        integer :: n, i, m
        real(dp) :: a(n), b(n)
        
        m = n/4
        
        ! Loop over the four parts
        b(1) = a(1)
        b(1+m) = a(1+m)
        b(1+2*m) = a(1+2*m)
        b(1+3*m) = a(1+3*m)
        do i = 2, m
            b(i) = a(i) + b(i-1)
            b(i+m) = a(i+m) + b(i+m-1)
            b(i+2*m) = a(i+2*m) + b(i+2*m-1)
            b(i+3*m) = a(i+3*m) + b(i+3*m-1)
        end do
        
        ! Adjusting
        b(m+1:2*m) = b(m+1:2*m) + b(m)
        b(2*m+1:3*m) = b(2*m+1:3*m) + b(2*m)
        b(3*m+1:4*m) = b(3*m+1:4*m) + b(3*m)
        
        do i = 4*m+1, n
            b(i) = a(i) + b(i-1)
        end do
    end subroutine
    
    subroutine cumsum3(n, a, b)
        integer :: n, i, m
        real(dp) :: a(n), b(n)
        real(dp) :: k1, k2, k3
        
        m = n/4
        
        ! Loop over the four parts
        b(1) = a(1)
        b(1+m) = a(1+m)
        b(1+2*m) = a(1+2*m)
        b(1+3*m) = a(1+3*m)
        do i = 2, m
            b(i) = a(i) + b(i-1)
            b(i+m) = a(i+m) + b(i+m-1)
            b(i+2*m) = a(i+2*m) + b(i+2*m-1)
            b(i+3*m) = a(i+3*m) + b(i+3*m-1)
        end do
        
        ! Adjusting
        k1 = b(m)
        k2 = b(2*m) + k1
        k3 = b(3*m) + k2
        do i = 1, m
            b(i+m) = b(i+m) + k1
            b(i+2*m) = b(i+2*m) + k2
            b(i+3*m) = b(i+3*m) + k3
        end do
        
        do i = 4*m+1, n
            b(i) = a(i) + b(i-1)
        end do
    end subroutine
end module

program cumsum_test
    use cumsum_mod
    implicit none
    integer, parameter :: dp = kind(1d0)
    real(dp), allocatable :: a(:), b(:), b1(:), b2(:), b3(:)
    integer :: n, m, i
    real(dp) :: t1, t2
    
    read *, n, m
    
    allocate (a(n), b(n), b1(n), b2(n), b3(n))
    call random_number(a)
    
    ! Heating up
    do i = 1, 20
        call cumsum1(n, a, b)
        call cumsum2(n, a, b)
        call cumsum3(n, a, b)
    end do
    
    call cpu_time(t1)
    do i = 1, m
        call cumsum1(n, a, b)
    end do
    call cpu_time(t2)
    print *, t2-t1
    
    call cpu_time(t1)
    do i = 1, m
        call cumsum2(n, a, b)
    end do
    call cpu_time(t2)
    print *, t2-t1

    call cpu_time(t1)
    do i = 1, m
        call cumsum3(n, a, b)
    end do
    call cpu_time(t2)
    print *, t2-t1
    
    ! Quick check of the difference
    call cumsum1(n, a, b1)
    call cumsum2(n, a, b2)
    call cumsum3(n, a, b3)
    print *, maxval(abs(b2-b1))
    print *, maxval(abs(b3-b1))
    deallocate (a, b, b1, b2, b3)
end program

The "optimized" way comes in two flavors: either the adjustment is done with four vector statements, e.g. b(m+1:2*m) = b(m+1:2*m) + b(m), either with a loop over the three parts at the same time.

I compile with Intel Fortran, with /O3 /QxHost, and the processor is an Intel Core i7 9700F (i.e. it has AVX2).

I am not going to post the whole assembly output (250 kb), but here is the main part of the first loop:

cumsum1

L1:
    inc       rcx
    vaddsd    xmm0, xmm0, QWORD PTR [8+rdx+r10]
    vmovsd    QWORD PTR [8+rdx+r8], xmm0
    vaddsd    xmm0, xmm0, QWORD PTR [16+rdx+r10]
    vmovsd    QWORD PTR [16+rdx+r8], xmm0
    add       rdx, 16
    cmp       rcx, r11
    jb        L1

cumsum2 and cumsum3

L1:
    vaddsd    xmm0, xmm0, QWORD PTR [8+rdi+r14*8]
    vaddsd    xmm1, xmm1, QWORD PTR [8+r12+r14*8]
    vaddsd    xmm2, xmm2, QWORD PTR [8+r11+r14*8]
    vaddsd    xmm3, xmm3, QWORD PTR [8+r9+r14*8]
    vmovsd    QWORD PTR [8+r8+r14*8], xmm0
    vmovsd    QWORD PTR [8+rcx+r14*8], xmm1
    vmovsd    QWORD PTR [8+rbp+r14*8], xmm2
    vmovsd    QWORD PTR [8+rsi+r14*8], xmm3
    inc       r14
    cmp       r14, r10
    jb        L1

Both are thus essentially "scalar" loops: no parallel SIMD operation is involved. The adjustment, however, involves parallel instructions as expected.

Now, the question: why is cumsum2/cumsum3 faster than cumsum1? That is, 3x faster for length 1000, 2x for length 10000 and 30-60% for length 100000:

length=1000 loops=100000000
cumsum1   85.8593750000000
cumsum2   27.2187500000000
cumsum3   28.0937500000000

length=10000 loops=1000000
cumsum1   8.78125000000000
cumsum2   4.51562500000000
cumsum3   4.56250000000000
  
length=100000 loops=1000000
cumsum1   87.8281250000000
cumsum2   52.9687500000000
cumsum3   63.3281250000000

My initial thought was that maybe the additions would be done in parallel, resulting in some improvement. However, that part of the code is sequential in all versions, and there is additional work afterwards to adjust the values, and cumsum2/cumsum3 are still much faster than cumsum1.

I suspect it's related to the memory cache, but I don't see how.

Any help is welcome! And if there are clever ways to make this even faster, I'm also interested.

Edit, to answer a comment

The second loop in cumsum3:

L1:
    vaddpd    ymm6, ymm5, YMMWORD PTR [rcx+r10*8]
    vmovupd   YMMWORD PTR [rcx+r10*8], ymm6
    vaddpd    ymm6, ymm4, YMMWORD PTR [rbp+r10*8]
    vmovupd   YMMWORD PTR [rbp+r10*8], ymm6
    vaddpd    ymm6, ymm3, YMMWORD PTR [rsi+r10*8]
    vmovupd   YMMWORD PTR [rsi+r10*8], ymm6
    vaddpd    ymm6, ymm5, YMMWORD PTR [32+rcx+r10*8]
    vmovupd   YMMWORD PTR [32+rcx+r10*8], ymm6
    vaddpd    ymm6, ymm4, YMMWORD PTR [32+rbp+r10*8]
    vmovupd   YMMWORD PTR [32+rbp+r10*8], ymm6
    vaddpd    ymm6, ymm3, YMMWORD PTR [32+rsi+r10*8]
    vmovupd   YMMWORD PTR [32+rsi+r10*8], ymm6
    vaddpd    ymm6, ymm5, YMMWORD PTR [64+rcx+r10*8]
    vmovupd   YMMWORD PTR [64+rcx+r10*8], ymm6
    vaddpd    ymm6, ymm4, YMMWORD PTR [64+rbp+r10*8]
    vmovupd   YMMWORD PTR [64+rbp+r10*8], ymm6
    vaddpd    ymm6, ymm3, YMMWORD PTR [64+rsi+r10*8]
    vmovupd   YMMWORD PTR [64+rsi+r10*8], ymm6
    vaddpd    ymm6, ymm5, YMMWORD PTR [96+rcx+r10*8]
    vmovupd   YMMWORD PTR [96+rcx+r10*8], ymm6
    vaddpd    ymm6, ymm4, YMMWORD PTR [96+rbp+r10*8]
    vmovupd   YMMWORD PTR [96+rbp+r10*8], ymm6
    vaddpd    ymm6, ymm3, YMMWORD PTR [96+rsi+r10*8]
    vmovupd   YMMWORD PTR [96+rsi+r10*8], ymm6
    add       r10, 16
    cmp       r10, r9
    jb        L1

While with cumsum2, the same is achieved with three loops, each one like:

L1:
    vaddpd    ymm2, ymm1, YMMWORD PTR [rcx+r11*8]
    vaddpd    ymm3, ymm1, YMMWORD PTR [32+rcx+r11*8]
    vaddpd    ymm4, ymm1, YMMWORD PTR [64+rcx+r11*8]
    vaddpd    ymm5, ymm1, YMMWORD PTR [96+rcx+r11*8]
    vmovupd   YMMWORD PTR [rcx+r11*8], ymm2
    vmovupd   YMMWORD PTR [32+rcx+r11*8], ymm3
    vmovupd   YMMWORD PTR [64+rcx+r11*8], ymm4
    vmovupd   YMMWORD PTR [96+rcx+r11*8], ymm5
    add       r11, 16
    cmp       r11, r12
    jb        L1

On modern CPUs, floating-point addition has a latency of ~3-4 cycles, but about 2 independent floating-point additions can be started each cycle. If you have a single chain of dependent additions, your total execution time will be determined by the latency, not by the possible throughput. You *do* get parallelism when you split the sum into 4 independent partial sums that are added together in the end. You could potentially get *much* more throughput by vectorizing, though you'll soon be memory-bound for such a simple operation. — EOF, Aug 31 '21 at 17:29
In particular, your cumsum2/3 are almost certainly going to be limited by memory *write* operations, so vectorizing *those* would open up another integer couple of x times speedup. — EOF, Aug 31 '21 at 17:34
@EOF Just to be sure, what do you mean by vectorizing, here? I guess the additions could be done in parallel, but the compiler didn't do that. Maybe I would have to reorder the array first? — , Aug 31 '21 at 17:50
@EOF: This is hard to vectorize because of the serial dependency. But yes, it's possible and still somewhat profitable, especially with wide vectors and for floating-point where latency > 1 cycle. [parallel prefix (cumulative) sum with SSE](https://stackoverflow.com/q/19494114) is IIRC a good combo of SIMD and thread-level parallelism. (And yes, for large problems memory bandwidth becomes a bottleneck when your chunk sizes are too big to fit in L1d or L2 cache before you come back to add the starting-point sum for a chunk.) — Peter Cordes, Aug 31 '21 at 18:03
The "adjusting" parts are hopefully already getting auto-vectorized because they're trivial, the classic use-case for SIMD. (Especially where you read `k1` and so on from the end of each chunk before doing more. But it probably doesn't help to interleave the operations on the 3 chunks; that may defeat auto-vectorization unless the compiler sees they don't overlap. One contiguous read/write stream might be better than 3 anyway, or maybe the unrolling will help hide cache misses if it can vectorize) — Peter Cordes, Aug 31 '21 at 18:59
@PeterCordes I updated the answer to show this part. When the array is large, cumsum2 gets better than cumsum3. — , Aug 31 '21 at 19:35
Oh good, then the adjust part of cumsum2 is auto-vectorizing optimally; I guessed that from the fact it was faster. The compiler isn't getting tripped up by reading some elements from the same array that's being updated. (Reading ahead of the updates may still be better if that lets the start value for chunk 2 be computed from chunk 0 and 1 without waiting for store/reload of the end of chunk 1, although out-of-order exec can still get started *loading* chunk 2 right away before the loop-invariant vector is ready.) — Peter Cordes, Aug 31 '21 at 19:47
@PeterCordes So, if I understand correctly, the performance gain comes from the pipeline, and there is also a performance drop from the cache misses (since, I guess, the first loop has the same problem as the second one in cumsum3). So there is a tradeoff here. I will experiment with more or less than 4 chunks, but I wonder, is there a way to predict the optimal way to cut the vector? It must depend on the instruction latencies and the pipeline size? Where can I learn more about this? — , Aug 31 '21 at 20:15
What are the constraints here? You're only willing to use pure portable Fortran, not intrinsics to vectorize manually? So any SIMD has to be asm that the compiler can create from normal Fortran? And you're tuning for single-threaded execution on a specific CPU, so you know the sizes of the per-core L1d and L2? Likely 4 chunks of total size 32kiB or 128kiB in parallel is good, or if you pick a chunk size that's not an exact multiple of 4k then 8 or more in parallel is good to hide FP add latency, and won't create conflict misses in L2 or L1d. — Peter Cordes, Aug 31 '21 at 21:16
You can read more about instruction throughput vs. latency in [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) / [Latency bounds and throughput bounds for processors for operations that must occur in sequence](https://stackoverflow.com/q/63095394) / [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) — Peter Cordes, Aug 31 '21 at 21:18
Re: how many read / write streams CPUs can handle efficiently: [For-loop efficiency: merging loops](https://stackoverflow.com/q/51021262). And re: tuning for cache sizes, it's not just *number* of chunks for ILP, it's *size* of each chunk to make sure you get back to them before they get evicted from cache. i.e. cache-block so you set an upper limit on amount of not-yet-adjusted array elements. You could call this "cache blocking" - [How does cache blocking actually speed up performance?](https://stackoverflow.com/q/63614160). Only use small chunks when your total array size is small. — Peter Cordes, Aug 31 '21 at 21:24

Optimizing cumulative sum

0 Answers0