What are some loop optimizations that can be applied to vector summation?

Question

I’m trying to figure out what are some loop optimizations (in compilers) that can be applied to (linear algebra) vector summation?

Vector sum is defined

for i in range(len(vector1)):
     vector_result[i] = vector1[i] + vector2[i]

How can I optimize this?

Some loop optimizations I’m aware of:

Loop fission
Loop fusion
Vectorization
Tiling But I don’t know how any of these can be applied to the summation of vectors.

If anybody out there know a thing or two about compilers and loop optimizations i will appreciate if you give me an example of the code after a loop optimization is applied. Thanks

Tiling isn't applicable since this is just one pass over two arrays. To make it more cache-friendly, you'd have to consider the bigger picture of the code that writes these arrays (can you sum on the fly while doing that, or write a piece then sum it?) — Peter Cordes, Oct 30 '22 at 02:52
If you actually meant `Sum += v1[i] + v2[i]`, then yes, you'd want to vectorize and unroll with multiple accumulators to hide FP latency, as in a dot product which has the same dependency pattern. (But it can use FMA, this can't since it's all additions.) See links at the top of my answer on [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/a/45114487), and that answer itself. — Peter Cordes, Oct 30 '22 at 02:53
But what you actually wrote, `Sum = stuff[i]`, overwrites Sum every iteration without reading it, so the correct optimization is to replace the loop with `Sum = v1[len-1] + v2[len-1]`, i.e. just add the two elements that set the final value of Sum, and don't touch any other elements. — Peter Cordes, Oct 30 '22 at 02:55
For cases where `len` isn't a multiple of the vector length, you'd want a cleanup loop at the end, like [The correct way to sum two arrays with SSE2 SIMD in C++](https://stackoverflow.com/q/39759936) — Peter Cordes, Oct 30 '22 at 03:01
After the edit, that's an even easier problem to vectorize; it's purely vertical, no reduction, so it can be vectorized for FP data without having to pretend FP math is associative (`-ffast-math`). — Peter Cordes, Oct 30 '22 at 03:58

bolov · Answer 1 · 2022-10-30T03:46:39.197

1

Note: This answer assumes sum = is a typo and the OP meant sum +=

The most important optimizations here are loop unrolling and vectorization.

The pseudocode would look like this:

Some notations and assumptions:

v[i : i + 4] represents a SIMD element, sum_4 again a SIMD element.
16 bytes alignment requirement

i = 0
sum_4 = 0..0;

// header
while vec1 + i !aligned at 16 bytes:
    sum_4[0] += vec1[i] + vec2[i];
    ++i

// simd - can be further unrolled with multiple vector registers
while i + 4 < len(vec1)
    sum_4 = vec1[i : i + 4] + vec2[i : i + 4]
    i += 4

// footer
while i < len(vec1)
    sum_4[0] = vec1[i] + vec2[i]
    ++i

// aggregate
int sum = horizontal_add(sum_4)

edited Oct 30 '22 at 03:46

answered Oct 30 '22 at 03:39

bolov

72,283
15
145
224

Note that the code in the question doesn't actually sum the array, it sums the last 2 elements. But presumably it was intended to be `Sum +=` like your answer is implementing, instead of `Sum =`. But the OP's self-answer uses `Sum[i] = v[i] + v[i]`, so it's a purely "vertical" vector operation, not a reduction, so no need to unroll with multiple vector accumulators to hide FP latency. – Peter Cordes Oct 30 '22 at 03:42
@PeterCordes I didn't even noticed that. Yes, `sum =` makes no sense as it discards the whole loop and it is most likely a typo. – bolov Oct 30 '22 at 03:45

chez93 · Answer 2 · 2022-10-30T03:56:57.237

0

So, I found this beautiful of a paper by Frances Allen:

paper

Based on her explanation I think I can unroll the vector sum loop by two or four, etc.

here's an example of unrolling the loop by two.

for i in range(0, len(v), 2):
    vector_result[i] = v[i] + v[i]
    vector_result[i+1] = v[i+1] + v[i+1]

edited Oct 30 '22 at 03:56

answered Oct 30 '22 at 03:26

chez93

131
6

Yes, and roll that back up into a SIMD operation that does two or four elements at once. Unrolling to create independent work is the first step of SIMD vectorization. – Peter Cordes Oct 30 '22 at 03:28
@PeterCordes how do I make so it actually optimizes? I tested some python code and the unroll by 2 runs slower than the normal for loop. Any tips? – chez93 Oct 30 '22 at 04:10
In what, CPython? It's not a compiler, it's a pretty basic interpreter with no ability to optimize. Write it in a compiled language like C++ or Rust and compile with clang or gcc `-O3 -march=native`. And don't unroll by hand in the source, that usually just makes it *harder* for compilers to auto-vectorize. See [C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/q/32000917) for some examples of how much it helps to enable optimization in C. (CPython has no optimizer, but Cython can compile Python to C.) – Peter Cordes Oct 30 '22 at 05:40
I assumed you were just using Python as pseudo-code for the loop logic, in a question about optimizing compilers for various languages. – Peter Cordes Oct 30 '22 at 05:42
Found a better example: [Does compiler use SSE instructions for a regular C code?](https://stackoverflow.com/a/50786580) shows a simple C loop and the resulting compiler output. – Peter Cordes Oct 30 '22 at 05:44
@PeterCordes yea I was using python as pseudo code for optimizing compilers in general. But I wanted to see what happened if I ran the code. So If I take that unrolled loop and compile to assembly it will be faster than a normal loop? I’m building a linear algebra compiler and I want to optimize – chez93 Oct 30 '22 at 12:55
@PeterCordes I am actually building a linear algebra compiler that compiles a simple linear algebra language to Common Lisp and to x86. the loop unrolling will be part of my intermediate language/representation. how do I ensure that this loop unrolling actually optimizes? any tips will be greatly appreciated-- thanks. – chez93 Oct 30 '22 at 13:17
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249167/discussion-between-sphereinamanifold-and-peter-cordes). – chez93 Oct 30 '22 at 16:24

What are some loop optimizations that can be applied to vector summation?

2 Answers2