3

I would like to know under which conditions invariant parts of nested loops can be optimized.

For doing so, I wrote two functions one of which implements the factorization of three nested loops while the other doesn't.

The non-factorized function looks like:

template<int k>
double __attribute__ ((noinline)) evaluate(const double u[], const double phi[])
{
  double f = 0.;
  for (int i3 = 0;i3<k;++i3)
    for (int i2 = 0;i2<k;++i2)
      for (int i1 = 0;i1<k;++i1)
        f += u[i1+k*(i2+k*i3)] * phi[i1] * phi[i2] * phi[i3];
  return f;
}

While the factorized function is:

template<int k>
double __attribute__ ((noinline)) evaluate_fact(const double u[], const double phi[])
{
  double f3 = 0.;
  for (int i3 = 0;i3<k;++i3)
    {
      double f2 = 0.;
      for (int i2 = 0;i2<k;++i2)
        {
          double f1 = 0.;
          for (int i1 = 0;i1<k;++i1)
            {
              f1 += u[i1+k*(i2+k*i3)] * phi[i1];
            }
          f2 += f1  * phi[i2];
        }
      f3 += f2 * phi[i3];
    }
  return f3;
}

That I call with the following main:

int main()
{
  const static unsigned int k=20;
  double u[k*k*k];
  double phi[k];

  phi[0] = 1.;
  for (unsigned int i=1;i<k;++i)
    phi[i] = phi[i-1]*.333;

  double e = 0.;
  for (unsigned int i=0;i<1000;++i)
    {
      e += evaluate<k>(u, phi);
      //e += evaluate_fact<k>(u, phi);
    }
  std::cout << "Evaluate " << e << std::endl;
}

For a small k both functions generate the same assembly code but after a certain size k~=10 the assembly does not look the same anymore and callgrind shows more operations being performed in the non-factorized version.

How should I write my code (if at all possible), or what should I tell GCC such that evaluate() is optimized to evaluate_fact() ???

I am using GCC 7.1.0. with flags -Ofast -fmove-loop-invariants

Using -funroll-loops does not help unless I add --param max-completely-peeled-insns=10000 --param max-completely-peel-times=10000 but that is a completely different thing because it is basically unrolling everything, the assembly is extensive.

Using -fassociative-math doesn't help either.

This paper claims that: "Traditional loop-invariant code motion, which is commonly applied by general-purpose compilers, only checks invariance with respect to the innermost loop." Does that apply to my code?

Thanks!

Astor
  • 288
  • 1
  • 7
  • 1
    Are you using any optimization flags when compiling with gcc? – Leonardo Alves Machado Jul 24 '17 at 21:35
  • @LeonardoAlvesMachado Yes! I am sorry not to have added it! I modified my post accordingly. Thanks! – Astor Jul 24 '17 at 21:40
  • 3
    It might have to do with operations on doubles not being associative. See these questions: https://stackoverflow.com/questions/13913017/are-floating-point-operations-in-c-associative, https://stackoverflow.com/questions/6430448/why-doesnt-gcc-optimize-aaaaaa-to-aaaaaa – clcto Jul 24 '17 at 21:46
  • 1
    I take that back as `-Ofast` should enable `-ffast-math` – clcto Jul 24 '17 at 21:48
  • @clcto Your comment is very relevant nonetheless! I also tried with -fassociative-math but it does not work for me... Thanks for the detail, I added it to the original post! – Astor Jul 24 '17 at 21:59
  • 1
    You can set a temporary (constant) to the expression `k*(i2+k*i3)` and place above the innermost `for` loop since the expression does not change inside the `for` loop. – Thomas Matthews Jul 24 '17 at 22:32
  • @ThomasMatthews Thanks! I tried it but I still get the difference :-/ – Astor Jul 24 '17 at 22:53
  • @ThomasMatthews In fact, I get the same numbers so I guess that the invariants `k*i3`, `k*(i2+k*i3)` are actually being moved. The arrays are the problem. – Astor Jul 24 '17 at 22:56
  • 1
    You may be able to speed up the execution by creating temporary arrays for `f1, f2` and `f3`. For example, first calculate all `f1[n]`. Next, calculate all `f2[n]` and finally all `f3` arrays. The objective with this organization is to allow the compiler to generate SIMD instructions. Also, more of the variables will be in the data cache, so there will be fewer cache misses. – Thomas Matthews Jul 24 '17 at 23:18
  • @ThomasMatthews Thanks again. I fully agree. These examples were created to show the possible hoisting being missed by GCC optimization, I'd like to know if there is a problem with GCC or is it my code before addressing the GCC guys. – Astor Jul 24 '17 at 23:25

0 Answers0