I am working on a high-performance parallel computational fluid dynamics code that involves a lot of lightweight loops and therefore gains approximately 30% in performance if all important loops are fully unrolled.
This can be done easily for a fixed number of loops by using compiler directives: #pragma GCC unroll (16)
is recognized by both compilers I am aiming for, the Intel C++ compiler ICC and GCC, while #pragma unroll (16)
is sadly ignored by GCC. I can also use template parameters or pre-preprocessor directives as as limits with ICC (similar to what you can do with nvcc), for instance
template <int N>
// ...
#pragma unroll (N)
for (int i = 0; i < N; ++i) {
// ...
}
or
#define N 16
#pragma unroll (N)
for (int i = 0; i < N; ++i) {
// ...
}
throw no error or warning with -Wall -w2 -w3
when compiling with ICC while the complementary syntax #pragma GCC unroll (N)
with GCC (-Wall -pedantic
) throws an error in GCC 9.2.1 20191102 in Ubuntu 18.04:
error: ‘#pragma GCC unroll’ requires an assignment-expression that evaluates to a non-negative integral constant less than 65535
#pragma GCC unroll (N)
Is somebody aware of a way to make loop unrolling based on a template parameter with compiler directives work in a portable way (at least working with GCC and ICC)? I actually only need full unrolling of the entire loop, so something like #pragma GCC unroll (all)
would already help me a lot.
I am aware that there exist more or less complex strategies to unroll loops with template meta-programming but as in my application the loops might be nested and can contain more complicated loop bodies, I feel like such a strategy would over-complicate my code and reduce readibility.