I'm writing a function to create a gaussian filter (using the armadillo library), which may be either 2D or 3D depending on the number of dimensions of the input it receives. Here is the code:
template <class ty>
ty gaussianFilter(const ty& input, double sigma)
{
// Our filter will be initialized to the same size as our input.
ty filter = ty(input); // Copy constructor.
uword nRows = filter.n_rows;
uword nCols = filter.n_cols;
uword nSlic = filter.n_elem / (nRows*nCols); // If 2D, nSlic == 1.
// Offsets with respect to the middle.
double rowOffset = static_cast<double>(nRows/2);
double colOffset = static_cast<double>(nCols/2);
double sliceOffset = static_cast<double>(nSlic/2);
// Counters.
double x = 0 , y = 0, z = 0;
for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
x = static_cast<double>(rowIndex) - rowOffset;
for (uword colIndex = 0; colIndex < nCols; colIndex++) {
y = static_cast<double>(colIndex) - colOffset;
for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
z = static_cast<double>(sliIndex) - sliceOffset;
// If-statement inside for-loop looks terribly inefficient
// but the compiler should take care of this.
if (nSlic == 1){ // If 2D, Gauss filter for 2D.
filter(rowIndex*nCols + colIndex) = ...
}
else
{ // Gauss filter for 3D.
filter((rowIndex*nCols + colIndex)*nSlic + sliIndex) = ...
}
}
}
}
As we see, there is an if-statement inside the inner-most loop, which checks if size of the third dimension (nSlic) is equal to 1. Once calculated in the beginning of the function, nSlic will not change it's value, so the compiler should be smart enough to optimise the conditional branch, and I shouldn't lose any performance.
However... if I remove the if-statement from within the loop, I get a performance boost.
if (nSlic == 1)
{ // Gauss filter for 2D.
for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
x = static_cast<double>(rowIndex) - rowOffset;
for (uword colIndex = 0; colIndex < nCols; colIndex++) {
y = static_cast<double>(colIndex) - colOffset;
for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
z = static_cast<double>(sliIndex) - sliceOffset;
{filter(rowIndex*nCols + colIndex) = ...
}
}
}
}
else
{
for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
x = static_cast<double>(rowIndex) - rowOffset;
for (uword colIndex = 0; colIndex < nCols; colIndex++) {
y = static_cast<double>(colIndex) - colOffset;
for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
z = static_cast<double>(sliIndex) - sliceOffset;
{filter((rowIndex*nCols + colIndex)*nSlic + sliIndex) = ...
}
}
}
}
After compiling with g++ -O3 -c -o main.o main.cpp
and measuring the execution time of both code variations I got the following:
(1000 repetitions, 2D matrix of size 2048)
If-inside:
- 66.0453 seconds
- 64.7701 seconds
If-outside:
- 64.0148 seconds
- 63.6808 seconds
Why doesn't the compiler optimise the branch if the value of nSlic doesn't even change? I necessarily have to restructure the code to avoid the if
-statement inside the for
-loop?