I spent the last couple of weeks optimizing a numerical algorithm. Through a combination of precomputation, memory alignment, compiler hints and flags, and trial and error experimentation, I brought the run-time down over an order of magnitude. I have not yet explicitly vectorized using intrinsics or used multi-threading.
Frequently when working on this type of problem, there is an initialization routine, after which, many parameters become constant. These might be filter lengths, the expression of a switch statement, for loop length or iteration increment. If the parameters were know at compile time, the compiler should be able to do a much more effective job of optimization by knowing exactly how to unroll loops, replace index calculations with instructions that have the offset encoded in the instruction, simplify or eliminate expressions at compile time, possibly eliminate switch statements, etc. The most extreme way of dealing with this problem would be to run the initialization routine (at run-time), then run the compiler on the critical function to be optimized using some kind of plugin that allows iteration over the abstract syntax tree, replace the parameters with constants, and finally dynamically link to the shared object. If the routine is short, it could be dynamically compiled inside the binary using a number of tools.
More practically, I rely very heavily on alignment, gcc __builtin_assume_aligned, restrict, manual loop unrolling, and compiler flags to get the compiler to do what I want given the unknown value of parameters at compile time. I'm wondering what other options are available to me that are at least close to portable. I only use intrinsics as a last resort since it's not portable and a lot of work. Specifically, how can I provide the compiler (gcc) with additional information concerning loop variables using either language semantics, compiler extensions, or external tools so it can do a better job of doing optimizations for me. Similarly is there any way to qualify variables as having a stride so that loads and stores are always aligned, thus more easily enabling the auto-vectorization and loop unrolling process.
These issues come up frequently, so I am hoping there is some more elegant method of solving them. What follows are examples of the kind of problems I hand optimize but I believe the compiler ought to be able to do for me. These are not intended to be further questions.
Sometimes you have a filter, the length of which is not a multiple of the length of the longest SIMD register, and there may be memory alignment issues as well. In this case case I either (A) unroll the loop by a multiple of the vector register and call into the unoptimized code for the epilogue/prologue or (B) pad the start or end of the filter with zeros. I've recently learned gcc and other compilers have the ability to peel loops. From the limited documentation I've been able to find, I believe the finest grain control you have over peeling is over entire functions (rather than individual loops) using compiler directives. Further, there are some parameters you can provide, but it's mostly just an upper or lower bound on the amount of unrolling or number of instructions produced.
In order to really know the best method of unrolling/peeling or zero padding, the compiler needs to know something about the length of the loop and/or the size of the increment. For example, it would be very helpful to know that a loop is likely to have a length greater than a million or less than 100. It would be helpful to know that the loop will always run either 32 or 34 times. In fact, since the compiler knows much more about the computer architecture than I do, it would be much better if it made all the unrolling decisions based on information I provide about the loop variables. I had a situation where I wanted the compiler to unroll a loop. I specifically gave it the #pragma GCC optimize ("unroll-loops")
directive. However, what it required to work was also the statement N &= ~7
, thus informing the compiler that the loop length was a multiple of 8. This is not a semantic feature of the language, and it does not have the effect of changing the value of N. It was strictly to inform the static analyzer of the compiler that the loop was already a multiple of the length of the AVX register. In this case I was lucky and it worked because gcc is very clever. But in other cases, my hints don't seem to work (or they do, but there is no compiler feedback to let me know the additional information was of no value). In one case I had to explicitly tell the compiler not to unroll the loop because the outer loop was very short and the overhead was not worth it. With the optimizer on the maximum setting, often the only way to know what's going on is to look at the assembly listing, make some changes, and try again.
In another situation I carefully unrolled a loop so the compiler would use the AVX registers. The manual unrolling was probably necessary as the compiler doesn't have sufficient information about the length of the loop or that the length was of a particular multiple. Unfortunately, the inner loop was accessing an unaligned array of floats of length four per group (16 byte alignment). The compiler was using only the legacy 128 bit XMM registers. After making a weak attempt to vectorize using AVX intrinsics, I discovered the extra overhead of the unaligned access made the performance no better than what gcc was doing. So I thought, I could align each group of floats on the start of a cache line and use a stride equal to the cache length (or half, which is the length of and AVX register) to eliminate the alignment problem. However, this may turn out to be ineffective due to the extra memory bandwidth. It's certainly more work on my part. It makes the code harder to understand. And, at the very least, getting the stride right would depend on compile time constants I would need to supply. I wonder if there is some simpler way to do this relying on the compiler to do all the work? I would be willing to try it if it meant only changing a line or two of code. It's not worth it if I have to do it manually (in this case anyway). (Thinking about it as I write this, I may be able to use a union or struct with 48 bytes of padding and and a few extra lines of code. I would have to give that some thought...)