2

I have to use a nested for-loop to compute the entries of a Eigen::MatrixXd type matrix output columnwise. Here input[0], input[1] and input[2] are defined as Eigen::ArrayXXd in order to use the elementwise oprerations. This part seems to be the bottleneck for my code. Can anyone help me to accelerate this loop? Thanks!

 for (int i = 0; i < r; i++) {
    for (int j = 0; j < r; j++) {
      for (int k = 0; k < r; k++) {
        output.col(i * (r * r) + j * r + k) =
            input[0].col(i) * input[1].col(j) * input[2].col(k);
      }
    }
  }
Doudou
  • 121
  • 4
  • 4
    *This part seems to be the bottleneck for my code.* Seems to be or is? You've seen this in your profiling of your code? (Which you've compiled with optimizations turned on?) – Borgleader Nov 13 '17 at 16:49
  • @Borgleader This is in fact the second most time consuming part of my code. – Doudou Nov 13 '17 at 16:56
  • 2
    It does a lot of work if `r` is large. But an optimizing compiler ought to see that `i * (r * r) + j * r` and `input[0].col(i) * input[1].col(j) ` doesn't vary with `k` and move those out of the inner loop. But we cannot tell if it does from this snippet. – Bo Persson Nov 13 '17 at 16:59
  • 1
    You would have to a) post a [mcve] and/or b) post much much more info like those you get from perf. We have almost nothing to work on here – Passer By Nov 13 '17 at 17:01
  • This might also be better suited for https://softwareengineering.stackexchange.com/ – Passer By Nov 13 '17 at 17:04
  • Possible duplicate of [Efficient Cartesian Product algorithm](https://stackoverflow.com/questions/1741364/efficient-cartesian-product-algorithm) – Caleth Nov 13 '17 at 17:05

2 Answers2

1

When thinking about optimizing code of a for loop, it helps to think, "Are there redundant calculations that I can eliminate?"

Notice how in the inner most loop, only k is changing. You should move all possible calculations that don't involve k out of that loop:

for (int i = 0; i < r; i++) {
  int temp1 = i * (r * r);
  for (int j = 0; j < r; j++) {
    int temp2 = j * r;
    for (int k = 0; k < r; k++) {
      output.col(temp1 + temp2 + k) =
          input[0].col(i) * input[1].col(j) * input[2].col(k);
    }
  }
}

Notice how i * (r * r) is being calculated over and over, but the answer is always the same! You only need to recalculate this when i increments. The same goes for j * r.

Hopefully this helps!

ajrind
  • 64
  • 4
1

To reduce the number of flops, you should cache the result of input[0]*input[1]:

ArrayXd tmp(input[0].rows());
for (int i = 0; i < r; i++) {
 for (int j = 0; j < r; j++) {
  tmp = input[0].col(i) * input[1].col(j);
  for (int k = 0; k < r; k++) {
    output.col(i * (r * r) + j * r + k) = tmp * input[2].col(k);
  }
 }
}

Then, to fully use your CPU, enable AVX/FMA with -march=native and of course compiler optimizations (-O3).

Then, to get an idea of what you could gain more, measure accurately the time taken by this part, count the number of multiplications (r^2*(n+r*n)), and then compute the number of floating point operations per second you achieve. Then compare it to the capacity of your CPU. If you're good, then the only option is to multithread one of the for loop using, e.g., OpenMP. The choice of which for loop depends on the size of your inputs, but you can try with the outer one, making sure each thread has its own tmp array.

ggael
  • 28,425
  • 2
  • 65
  • 71