Efficiency comparison: function pointer vs function object vs branched code. Why is function object worst performing?

Question

I was hoping to improve performance by passing function pointer or a function object to a function call within a nested loop, in order to avoid branching of the loop. Below are three codes: one with function object, with function pointer and with branching. For any of compiler optimization option or for any of the problem size, the function pointer and object versions both perform the least. This is surprising to me; why would the overhead due to function pointer or object scale with problem size? Second question. Why is the function object performing worse than the function pointer?

Update

To the end I am also adding a lambda expression version of the same code. Again the brute force wins. The lambda expression version takes more than twice the time with or without optimization compared to the corresponding brute force code, and for different problem size.

Codes below. Execute with ./a.out [SIZE] [function choice]

Function Object:

    #include <iostream>
    #include <chrono>
    class Interpolator
    {
    public:
      Interpolator(){};
      virtual double operator()(double left, double right) = 0;
    };
    class FirstOrder : public Interpolator
    {
    public:
      FirstOrder(){};
      virtual double operator()(double left, double right) { return 2.0 * left * left * left + 3.0 * right; }
    };
    class SecondOrder : public Interpolator
    {
    public:
      SecondOrder(){};
      virtual double operator()(double left, double right) { return 2.0 * left * left + 3.0 * right * right; }
    };
    
    double kernel(double left, double right, Interpolator *int_func) { return (*int_func)(left, right); }
    
    int main(int argc, char *argv[])
    {
      double *a;
      int SIZE = atoi(argv[1]);
      int it = atoi(argv[2]);
      //initialize
      a = new double[SIZE];
      for (int i = 0; i < SIZE; i++)
        a[i] = (double)i;
      std::cout << "Initialized" << std::endl;
      Interpolator *first;
      switch (it)
      {
      case 1:
        first = new FirstOrder();
        break;
      case 2:
        first = new SecondOrder();
        break;
      }
      std::cout << "function" << std::endl;
      auto start = std::chrono::high_resolution_clock::now();
      //loop
      double g;
      for (int i = 0; i < SIZE; i++)
      {
        g = 0.0;
        for (int j = 0; j < SIZE; j++)
        {
          g += kernel(a[i], a[j], first);
        }
        a[i] += g;
      }
      auto stop = std::chrono::high_resolution_clock::now();
      auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
      std::cout << "Finalized in " << duration.count() << " ms" << std::endl;
      return 0;
    }

Function Pointer:

#include <iostream>
#include <chrono>

double firstOrder(double left, double right) { return 2.0 * left * left * left + 3.0 * right; }
double secondOrder(double left, double right) { return 2.0 * left * left + 3.0 * right * right; }
double kernel(double left, double right, double (*f)(double, double))
{
    return (*f)(left, right);
}
int main(int argc, char *argv[])
{
    double *a;
    int SIZE = atoi(argv[1]);
    int it = atoi(argv[2]);
    a = new double[SIZE];
    for (int i = 0; i < SIZE; i++)
        a[i] = (double)i; // initialization
    std::cout << "Initialized" << std::endl;

    //Func func(it);
    double (*func)(double, double);
    switch (it)
    {
    case 1:
        func = &firstOrder;
        break;
    case 2:
        func = &secondOrder;
        break;
    }
    std::cout << "function" << std::endl;
    auto start = std::chrono::high_resolution_clock::now();
    //loop
    double g;
    for (int i = 0; i < SIZE; i++)
    {
        g = 0.0;
        for (int j = 0; j < SIZE; j++)
        {
            g += kernel(a[i], a[j], func);
        }
        a[i] += g;
    }
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
    std::cout << "Finalized in " << duration.count() << " ms" << std::endl;
    return 0;
}

Branching:

#include <iostream>
#include <chrono>

double firstOrder(double left, double right) { return 2.0 * left * left * left + 3.0 * right; }
double secondOrder(double left, double right) { return 2.0 * left * left + 3.0 * right * right; }
int main(int argc, char *argv[])
{
    double *a;
    int SIZE = atoi(argv[1]); // array size
    int it = atoi(argv[2]);   // function choice
    //initialize
    a = new double[SIZE];
    double g;
    for (int i = 0; i < SIZE; i++)
        a[i] = (double)i; // initialization
    std::cout << "Initialized" << std::endl;

    auto start = std::chrono::high_resolution_clock::now();
    //loop
    for (int i = 0; i < SIZE; i++)
    {
        g = 0.0;
        for (int j = 0; j < SIZE; j++)
        {
            if (it == 1)
            {
                g += firstOrder(a[i], a[j]);
            }
            else if (it == 2)
            {
                g += secondOrder(a[i], a[j]);
            }
        }
        a[i] += g;
    }
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
    std::cout << "Finalized in " << duration.count() << " ms" << std::endl;
    return 0;
}

Lambda expression

#include <iostream>                                                             
#include <chrono>                                                               
#include<functional>                                                            
                                                                                
std::function<double(double, double)> makeLambda(int kind){                     
  return [kind] (double left, double right){                                    
    if(kind == 0) return 2.0 * left * left * left + 3.0 * right;                
    else if (kind ==1) return 2.0 * left * left + 3.0 * right * right;          
  };                                                                            
}                                                                               
                                                                                
int main(int argc, char *argv[])                                                
{                                                                               
  double *a;                                                                    
  int SIZE = atoi(argv[1]);                                                     
  int it = atoi(argv[2]);                                                       
  //initialize                                                                  
  a = new double[SIZE];                                                         
  for (int i = 0; i < SIZE; i++)                                                
    a[i] = (double)i;                                                           
  std::cout << "Initialized" << std::endl;                                      
  std::function<double(double,double)> interp ;                                 
  switch (it)                                                                   
  {                                                                             
  case 1:                                                                       
    interp = makeLambda(0);                                                     
    break;                                                                      
  case 2:                                                                       
    interp = makeLambda(1);                                                     
    break;                                                                      
  }                                                                             
  std::cout << "function" << std::endl;                                         
  auto start = std::chrono::high_resolution_clock::now();                       
  //loop                                                                        
  double g;                                                                     
  for (int i = 0; i < SIZE; i++)                                                
  {                                                                             
    g = 0.0;                                                                    
    for (int j = 0; j < SIZE; j++)                                              
    {                                                                           
      g += interp(a[i], a[j]);                                                  
    }                                                                           
    a[i] += g;                                                                  
  }                                                                             
  auto stop = std::chrono::high_resolution_clock::now();                        
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
  std::cout << "Finalized in " << duration.count() << " ms" << std::endl;       
  return 0;                                                                     
}

Compare with the fourth option: place `if` outside the loop (though there would be no difference as your compiler does it for you). — bipll, Nov 01 '20 at 04:44
Function object probably has more memory de-refs because you used virtual functions so it has to lookup the correct function based on the object type. — John3136, Nov 01 '20 at 04:44
@bipll I tried with -O0 as well, and the outcome is same. Will the compiler do that in that case as well? — P. Nair, Nov 01 '20 at 04:46
@John3136 I will do another example without virtual function. — P. Nair, Nov 01 '20 at 04:47
@P.Nair When you compile without optimizations, the compiler is not the one making the branching case faster. Rather, that case is likely benefiting from [branch prediction](https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array) and getting a benefit similar to the compiler's optimization. — JaMiT, Nov 01 '20 at 05:41
I also checked with clang with both -O0 and -O3 and the third version is faster. The third version generates smaller assembly code. However, I do not understand why. — Ali, Nov 01 '20 at 13:33
One of the main reasons I would use a functor or a lambda is to provide the function as an argument to a function within which computations with complex nested loops happen. So that the internal kernels need not have if conditions. Does the above comparison then mean that my reasoning is wrong for such a use case? — P. Nair, Nov 12 '20 at 07:26
@bipll I tried this and you are right. I am not posting the code here. This would be bloated code, which is what we are trying to avoid. Will the lambda or function versions become more efficient than the brute force, beyond some threshold complexity? — P. Nair, Nov 12 '20 at 07:44

Ali · Answer 1 · 2020-11-12T18:42:31.003

Although, this question is about performance, I have a few comments to improve the code:

Branching could be potentially more error-prone

Imagine that you want to add another interpolation function. Then, you need to define a new function and add a new case (for switch) or a new if/else. The solution could be to create a vector of lambdas:

std::vector<std::function<double(double, double)>> Interpolate {
  [](double left, double right) {return 2.0*left*left*left + 3.0*right;}, //first order
  [](double left, double right) {return 2.0*left*left + 3.0*right*right;} //second order
};

or alternatively:

double firstOrder(double left, double right) {return 2.0*left*left*left + 3.0*right;}
double secondOrder(double left, double right) {return 2.0*left*left + 3.0*right*right;}
std::array<double(*)(double, double), 2> Interpolate {firstOrder, secondOrder};

With this change, you do not need an if or switch statement. You simply write:

g += Interpolate[it-1] (x, y);

instead of

if (it == 1)
    g += firstOrder(a[i], a[j]);
else if (it == 2)
    g += secondOrder(a[i], a[j]);

Therefore, there is less maintenance required and it is less probable to miss an if/else statement.

Better to avoid bare new

Instead of writing double *a = new double[SIZE]; people suggest using std::vector<double> a (SIZE);. This way we do not need to release any resource and we avoid a potential memory leak in the code.

Back to the question, I do not see a reason why lambdas should result in better performance. Especially, in this case, that we do not benefit from the constexpr nature of lambdas.