1

I'm writing a function to create a gaussian filter (using the armadillo library), which may be either 2D or 3D depending on the number of dimensions of the input it receives. Here is the code:

template <class ty>
ty gaussianFilter(const ty& input, double sigma)
{
    // Our filter will be initialized to the same size as our input.
    ty filter = ty(input); // Copy constructor.

    uword nRows = filter.n_rows;
    uword nCols = filter.n_cols;
    uword nSlic = filter.n_elem / (nRows*nCols); // If 2D, nSlic == 1.

    // Offsets with respect to the middle.
    double rowOffset = static_cast<double>(nRows/2);
    double colOffset = static_cast<double>(nCols/2);
    double sliceOffset = static_cast<double>(nSlic/2);

    // Counters.
    double x = 0 , y = 0, z = 0;

for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
      x = static_cast<double>(rowIndex) - rowOffset;
      for (uword colIndex = 0; colIndex < nCols; colIndex++) {
        y = static_cast<double>(colIndex) - colOffset;
        for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
          z = static_cast<double>(sliIndex) - sliceOffset;
          // If-statement inside for-loop looks terribly inefficient
          // but the compiler should take care of this.
          if (nSlic == 1){ // If 2D, Gauss filter for 2D.
            filter(rowIndex*nCols + colIndex) = ...
          }
          else
          { // Gauss filter for 3D. 
            filter((rowIndex*nCols + colIndex)*nSlic + sliIndex) = ...
          }
       }    
     }
 }

As we see, there is an if-statement inside the inner-most loop, which checks if size of the third dimension (nSlic) is equal to 1. Once calculated in the beginning of the function, nSlic will not change it's value, so the compiler should be smart enough to optimise the conditional branch, and I shouldn't lose any performance.

However... if I remove the if-statement from within the loop, I get a performance boost.

if (nSlic == 1)
  { // Gauss filter for 2D.
    for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
      x = static_cast<double>(rowIndex) - rowOffset;
      for (uword colIndex = 0; colIndex < nCols; colIndex++) {
        y = static_cast<double>(colIndex) - colOffset;
        for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
          z = static_cast<double>(sliIndex) - sliceOffset;
          {filter(rowIndex*nCols + colIndex) = ...
        }
      } 
    }
  }
else
  {
    for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
      x = static_cast<double>(rowIndex) - rowOffset;
      for (uword colIndex = 0; colIndex < nCols; colIndex++) {
        y = static_cast<double>(colIndex) - colOffset;
        for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
          z = static_cast<double>(sliIndex) - sliceOffset;
          {filter((rowIndex*nCols + colIndex)*nSlic + sliIndex) = ...                                     
        }
      } 
    }
  }

After compiling with g++ -O3 -c -o main.o main.cpp and measuring the execution time of both code variations I got the following:
(1000 repetitions, 2D matrix of size 2048)

If-inside:

  • 66.0453 seconds
  • 64.7701 seconds

If-outside:

  • 64.0148 seconds
  • 63.6808 seconds

Why doesn't the compiler optimise the branch if the value of nSlic doesn't even change? I necessarily have to restructure the code to avoid the if-statement inside the for-loop?

Daniel
  • 11,332
  • 9
  • 44
  • 72
  • 2
    I'm confused by what you're asking. You moved an if statement out of a nested loop and are surprised your code runs faster? Do you expect the compiler to convert your first version of code to your second? – Kevin Dec 14 '15 at 23:26
  • I believed that if the `if`-statement will always yield the same result, the compiler would optimise it. My assumptions come from [sorted vs. unsorted array](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array). I'd like to understand why this is not the case, and when I can expect such compiler optimisations. – Daniel Dec 14 '15 at 23:28
  • Oh I see. That isn't the work of the compiler though. The processor handles branch prediction. – Kevin Dec 14 '15 at 23:32
  • Branch prediction is a mechanism physically built into processors themselves to minimise the impact loops have on instructions in [the pipeline](https://en.wikipedia.org/wiki/Instruction_pipelining), it has nothing to do with compiler optimisations. – kamilk Dec 14 '15 at 23:32
  • So it has nothing to do with the instructions themselves? And shouldn't the processor figure out that the branch here is always going in the same direction? – Daniel Dec 14 '15 at 23:34
  • A simple branch predictor uses a small table of small values. Part of the branch instruction's address is used as the index into the table to determine the likelyhood of the branch being taken. You could have collisions in the predictor's table leading to branch prediction being less accurate than normal. I'm not sure if this is the answer to your problem though. – Kevin Dec 14 '15 at 23:37
  • The compiler does not perform loop unrolling or function inlining when you specify '-O2' Try -O3 and also -funroll-loops , or just a better compiler, like Intel.. – Soonts Dec 14 '15 at 23:42
  • 3
    @dpgomez: The compiler optimization you're think of is called [`loop unswitching`](https://en.wikipedia.org/wiki/Loop_unswitching). If you're using gcc you might need to specify `-O3` or `-funswitch-loops` to enable it. – Blastfurnace Dec 14 '15 at 23:43
  • I'll give -O3 a try. I'm still confused, since I was just told that branch prediction has nothing to do with compiler optimisations.. – Daniel Dec 14 '15 at 23:45
  • Why does your 2-D branch even enter the third loop? It doesn't make any sense for either code organization, since the `z` value calculated from the third loop variable is not used within, and the third loop executes exactly once. All the setup and conditionals and calculation of `z` on that loop are a waste when `nSlic == 1` – Ben Voigt Dec 14 '15 at 23:45
  • Use a boolean template parameter, and instant the appropriate template for the "constant" of the condition. This solution avoids code duplication and essentially asks the compiler to do the optimization for you ;) – Karoly Horvath Dec 14 '15 at 23:51
  • I've changed the optimisation to O3 and got similar results. Thanks for the links, I'll read up on `loop unswitching` and the pipeline. It still isn't 100% clear to me the interplay between the optimised instructions my compiler outputs, and the branch prediction done by the processor... – Daniel Dec 14 '15 at 23:58

3 Answers3

1

Your error is here:

optimise the conditional branch, and I shouldn't lose any performance

Branch prediction may be helping you a lot, compared to actually performing a pipeline stall associated with an unknown branch. But it's still an extra instruction in the pipeline, which still has costs. The processor magic has reduced the cost of useless code... greatly reduced but not zero.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
1

Having an extra variable in the loop will affect register usage, which might affect timing, even if the branch prediction is working properly. You would need to look at the generated assembly to know. It also might affect the cache hit rate which is hard to detect.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
1

The interplay between compiler and HW is this - The compiler may be able to optimize the branch away, making the code itself optimized, but as you can see, that generates a lot of code bloating as it effectively duplicates the entire loop. Some compilers may include this optimization by default, and others may require explicitly asking for it ask you have done.

Alternatively, if the compiler avoids this optimization, the code retains the branch and the HW is left to predict it as best as it can. This involves complicated branch predictors, which have finite tables and are therefor limited in the amount of learning they can reach. In this example you do not have too many competing branches (the loops, the function calls and returns, and the if we're discussing), but we do not see the internal works of the function called, it may have more branch instructions (flushing out what you've learned outside), or it may be long enough to flush any global history the predictor may be using. It's hard to say without seeing the code, and without knowing what exactly your branch predictor does (which depends among other things on the CPU version you use).

One more note - it may not necessarily be related to branch predictions, changing the code like that may change alignment in the code cache or some internal cyclic buffers used for optimizing loops (such as this), which may cause dramatic changes in performance. The only way to know is to run some profiling based on HW counters (perf, vtune, etc..), and measuring the change in the number of branches and mispredictions.

Leeor
  • 19,260
  • 5
  • 56
  • 87