I would like to know under which conditions invariant parts of nested loops can be optimized.
For doing so, I wrote two functions one of which implements the factorization of three nested loops while the other doesn't.
The non-factorized function looks like:
template<int k>
double __attribute__ ((noinline)) evaluate(const double u[], const double phi[])
{
double f = 0.;
for (int i3 = 0;i3<k;++i3)
for (int i2 = 0;i2<k;++i2)
for (int i1 = 0;i1<k;++i1)
f += u[i1+k*(i2+k*i3)] * phi[i1] * phi[i2] * phi[i3];
return f;
}
While the factorized function is:
template<int k>
double __attribute__ ((noinline)) evaluate_fact(const double u[], const double phi[])
{
double f3 = 0.;
for (int i3 = 0;i3<k;++i3)
{
double f2 = 0.;
for (int i2 = 0;i2<k;++i2)
{
double f1 = 0.;
for (int i1 = 0;i1<k;++i1)
{
f1 += u[i1+k*(i2+k*i3)] * phi[i1];
}
f2 += f1 * phi[i2];
}
f3 += f2 * phi[i3];
}
return f3;
}
That I call with the following main:
int main()
{
const static unsigned int k=20;
double u[k*k*k];
double phi[k];
phi[0] = 1.;
for (unsigned int i=1;i<k;++i)
phi[i] = phi[i-1]*.333;
double e = 0.;
for (unsigned int i=0;i<1000;++i)
{
e += evaluate<k>(u, phi);
//e += evaluate_fact<k>(u, phi);
}
std::cout << "Evaluate " << e << std::endl;
}
For a small k both functions generate the same assembly code but after a certain size k~=10 the assembly does not look the same anymore and callgrind shows more operations being performed in the non-factorized version.
How should I write my code (if at all possible), or what should I tell GCC such that evaluate() is optimized to evaluate_fact() ???
I am using GCC 7.1.0. with flags -Ofast -fmove-loop-invariants
Using -funroll-loops does not help unless I add --param max-completely-peeled-insns=10000 --param max-completely-peel-times=10000 but that is a completely different thing because it is basically unrolling everything, the assembly is extensive.
Using -fassociative-math doesn't help either.
This paper claims that: "Traditional loop-invariant code motion, which is commonly applied by general-purpose compilers, only checks invariance with respect to the innermost loop." Does that apply to my code?
Thanks!