0

I’m trying to figure out what are some loop optimizations (in compilers) that can be applied to (linear algebra) vector summation?

Vector sum is defined

for i in range(len(vector1)):
     vector_result[i] = vector1[i] + vector2[i]

How can I optimize this?

Some loop optimizations I’m aware of:

  1. Loop fission
  2. Loop fusion
  3. Vectorization
  4. Tiling But I don’t know how any of these can be applied to the summation of vectors.

If anybody out there know a thing or two about compilers and loop optimizations i will appreciate if you give me an example of the code after a loop optimization is applied. Thanks

chez93
  • 131
  • 6
  • Tiling isn't applicable since this is just one pass over two arrays. To make it more cache-friendly, you'd have to consider the bigger picture of the code that writes these arrays (can you sum on the fly while doing that, or write a piece then sum it?) – Peter Cordes Oct 30 '22 at 02:52
  • 1
    If you actually meant `Sum += v1[i] + v2[i]`, then yes, you'd want to vectorize and unroll with multiple accumulators to hide FP latency, as in a dot product which has the same dependency pattern. (But it can use FMA, this can't since it's all additions.) See links at the top of my answer on [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/a/45114487), and that answer itself. – Peter Cordes Oct 30 '22 at 02:53
  • 1
    But what you actually wrote, `Sum = stuff[i]`, overwrites Sum every iteration without reading it, so the correct optimization is to replace the loop with `Sum = v1[len-1] + v2[len-1]`, i.e. just add the two elements that set the final value of Sum, and don't touch any other elements. – Peter Cordes Oct 30 '22 at 02:55
  • 1
    For cases where `len` isn't a multiple of the vector length, you'd want a cleanup loop at the end, like [The correct way to sum two arrays with SSE2 SIMD in C++](https://stackoverflow.com/q/39759936) – Peter Cordes Oct 30 '22 at 03:01
  • 1
    After the edit, that's an even easier problem to vectorize; it's purely vertical, no reduction, so it can be vectorized for FP data without having to pretend FP math is associative (`-ffast-math`). – Peter Cordes Oct 30 '22 at 03:58

2 Answers2

1

Note: This answer assumes sum = is a typo and the OP meant sum +=


The most important optimizations here are loop unrolling and vectorization.

The pseudocode would look like this:

Some notations and assumptions:

  • v[i : i + 4] represents a SIMD element, sum_4 again a SIMD element.
  • 16 bytes alignment requirement
i = 0
sum_4 = 0..0;

// header
while vec1 + i !aligned at 16 bytes:
    sum_4[0] += vec1[i] + vec2[i];
    ++i

// simd - can be further unrolled with multiple vector registers
while i + 4 < len(vec1)
    sum_4 = vec1[i : i + 4] + vec2[i : i + 4]
    i += 4

// footer
while i < len(vec1)
    sum_4[0] = vec1[i] + vec2[i]
    ++i

// aggregate
int sum = horizontal_add(sum_4)
bolov
  • 72,283
  • 15
  • 145
  • 224
  • Note that the code in the question doesn't actually sum the array, it sums the last 2 elements. But presumably it was intended to be `Sum +=` like your answer is implementing, instead of `Sum =`. But the OP's self-answer uses `Sum[i] = v[i] + v[i]`, so it's a purely "vertical" vector operation, not a reduction, so no need to unroll with multiple vector accumulators to hide FP latency. – Peter Cordes Oct 30 '22 at 03:42
  • @PeterCordes I didn't even noticed that. Yes, `sum =` makes no sense as it discards the whole loop and it is most likely a typo. – bolov Oct 30 '22 at 03:45
0

So, I found this beautiful of a paper by Frances Allen:

paper

Based on her explanation I think I can unroll the vector sum loop by two or four, etc.

here's an example of unrolling the loop by two.

for i in range(0, len(v), 2):
    vector_result[i] = v[i] + v[i]
    vector_result[i+1] = v[i+1] + v[i+1]
chez93
  • 131
  • 6
  • Yes, and roll that back up into a SIMD operation that does two or four elements at once. Unrolling to create independent work is the first step of SIMD vectorization. – Peter Cordes Oct 30 '22 at 03:28
  • @PeterCordes how do I make so it actually optimizes? I tested some python code and the unroll by 2 runs slower than the normal for loop. Any tips? – chez93 Oct 30 '22 at 04:10
  • In what, CPython? It's not a compiler, it's a pretty basic interpreter with no ability to optimize. Write it in a compiled language like C++ or Rust and compile with clang or gcc `-O3 -march=native`. And don't unroll by hand in the source, that usually just makes it *harder* for compilers to auto-vectorize. See [C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/q/32000917) for some examples of how much it helps to enable optimization in C. (CPython has no optimizer, but Cython can compile Python to C.) – Peter Cordes Oct 30 '22 at 05:40
  • I assumed you were just using Python as pseudo-code for the loop logic, in a question about optimizing compilers for various languages. – Peter Cordes Oct 30 '22 at 05:42
  • Found a better example: [Does compiler use SSE instructions for a regular C code?](https://stackoverflow.com/a/50786580) shows a simple C loop and the resulting compiler output. – Peter Cordes Oct 30 '22 at 05:44
  • @PeterCordes yea I was using python as pseudo code for optimizing compilers in general. But I wanted to see what happened if I ran the code. So If I take that unrolled loop and compile to assembly it will be faster than a normal loop? I’m building a linear algebra compiler and I want to optimize – chez93 Oct 30 '22 at 12:55
  • @PeterCordes I am actually building a linear algebra compiler that compiles a simple linear algebra language to Common Lisp and to x86. the loop unrolling will be part of my intermediate language/representation. how do I ensure that this loop unrolling actually optimizes? any tips will be greatly appreciated-- thanks. – chez93 Oct 30 '22 at 13:17
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249167/discussion-between-sphereinamanifold-and-peter-cordes). – chez93 Oct 30 '22 at 16:24