Automatic definition of visitors from CRTP (CRTP using boost foreach with boost variant)

Question

I need to implement an efficient visit of a vector of objects implementing a same interface. Until now, I was using inheritence with virtual functons: the interface is defined as an abstract class with pure-virtual functions and each class of objects implements the virtual functions. The vector of objects is simply a vector of pointers on the abstract class (see the end of the message for an example of dynamic visit).

I need a faster visit of the object collection. Since I know at compilation time all the possible classes of object, I used boost::variant to implement the collection of objects (ie a vector of boost::variant). I need the extra definition of a visitor scheme to go through the collection. To make explicit that all the objects implement the same interface, I used a CRTP to obtain a static inheritence: the interface is a CRTP abstraction, and each class of object derivated from the templated CRTP abstract class.

Here is an example of the CRTP implementation. The interface simply defines two functions f() and g(double). There is two derivated classes C1 and C2 implementing the interface (with identical behavior).

#include <vector>
#include <boost/foreach.hpp>
#include <boost/shared_ptr.hpp>
#include <boost/variant.hpp>

namespace testVariantSimple
{

  // Definition of the interface (abstract class).
  template< typename C >
  struct CBase
  {
    void f() { static_cast<C&>(*this).f(); }
    int g(const double & x) { return static_cast<C&>(*this).g(x); }
  };

  // Definition of the first implementation.
  struct C1 : public CBase<C1>
  {
    void f();
    int g(const double & x);
  };
  void C1::f() { return ; }
  int C1::g(const double & x) { return sizeof(x); }

  // Definition of the second implementation.        
  struct C2 : public CBase<C2>
  {
    void f();
    int g(const double & x);
  };
  void C2::f() { return ; }
  int C2::g(const double & x) { return sizeof(x); }

  // Definition of the visitor for the first function of the interface.  
  class f_visitor : public boost::static_visitor<int>
  {
  public:
    template< typename C >
    int operator()(CBase<C> &c ) const { c.f(); return 0; }
  };

  // Definition of the visitor for the second function of the interface.  
  struct g_visitor : public boost::static_visitor<int>
  {
    const double & x;
    g_visitor( const double & x ) : x(x) {}
  public:
    template< typename C >
    int operator()(CBase<C> & c) const { return c.g(x);  }
  };

  // Example of use: construct a random collection and visit it.
  void test(int nbSample)
  {
    typedef boost::variant<C1,C2> CV;
    std::vector<CV> vec;

    for( int i=0;i<nbSample;++i ) 
      {
    switch( std::rand() % 2 )
      {
      case 1:       vec.push_back( C1() ); break;
      case 2:       vec.push_back( C2() ); break;
      }
      }

    double argdouble;
    BOOST_FOREACH(CV & c, vec)
      {
    boost::apply_visitor( f_visitor(), c );
    g_visitor g(argdouble);
    boost::apply_visitor( g, c );
      }
  }
}

This piece of code works and is 15 times more efficient than the code using dynamic inheritence (see the end of the message for the code using dynamics). The code is slightly more difficult to read for somebody not familiar with CRTP, but not more difficult to maintain or write. Since the interface is explicit with the CRTP, the visitor implementation are rather trivial, but verbose, painful to understand and to use.

My question is simple: is it possible to define the visitor automatically from the CRTP interface. I would like to avoid the extra definition of f_visitor and g_visitor, and to obtain a more readable look like that:

   BOOST_FOREACH( CV & c, vec )
   {
      c.f();
      c.g(argdouble);
   }

Thanks for your help. For the interested reader, here is the same code using the virtual inheritence.

namespace testDynamicSimple
{
  struct CBase
  {
    virtual void f() = 0;
    virtual int g(const double & x) = 0;
  };

  struct C1 : public CBase  
  {
    void f() {}
    int g(const double & x) { return 1; }
  };
  struct C2 : public CBase
  {
    void f() {}
    int g(const double & x) { return 2; }
  };

  bool test(int nbSample)
  {
    typedef boost::shared_ptr<CBase> CV;
    std::vector<CV> vec;
    for( int i=0;i<nbSample;++i ) 
      {
    switch( std::rand() % 5 )
      {
      case 1:       vec.push_back( CV(new C1()) ); break;
      case 2:       vec.push_back( CV(new C2()) ); break;
      }
      }

    double argdouble = 0.0;
    BOOST_FOREACH( CV & c, vec)
    {
      c->f();
      c->g(argdouble);
    }
  }
}

*"here is the same code using the virtual inheritence."* You mean *using virtual functions*. Virtual inheritance is something different: http://stackoverflow.com/questions/21558/in-c-what-is-a-virtual-base-class — dyp, Jul 14 '14 at 19:39
I don't quite see what's the reason for the base class template in the first code sample. — dyp, Jul 14 '14 at 19:45
@dyp: thanks for your relevant remarks. I consequently corrected the text.The use of the CRTP is to make the interface explicit. Indeed as you said, it is not necessary in the first code sample since the relation between the objects is handled by the visitors. I would like to use the fact that the CRTP explicitly describes the interface to simplify the visitor scheme. More precisely, looking at the visitor code, it is rather trivial even if verbose. I "feel" that there should be a simpler way to "automatically" define them from the CRTP but I cannot find it. — nmsd, Jul 15 '14 at 08:25
Ah, I see. By applying a visitor, you restore a concrete type from a type erased at run-time. That is, the code inside the FOREACH loop has to "support" multiple different types. You need overloading to do that. In C++1y, you could use a polymorphic lambda, something like `apply_visitor([&](auto& x) { x.f(); x.g(argdouble); }, c)`. In C++03 and C++11 you're out of luck, since you cannot define local templates. I think it is possible though to use virtual functions without much performance impact, since one indirection is required in any way. `shared_ptr` is probably the culprit. — dyp, Jul 15 '14 at 10:10
Thanks for the answer. I tried with or without boost::shared_ptr. There is no significative difference. The timings on a Xeon 2.7GHz with gcc 4.6 are the following. For going through a vector of 100 elements, 0.3 microsecs if using virtual functions (for either boost::shared or c pointers) and 0 (not measurable) for boost::variant with CRTP. Nota: 0.3us are relevant for me since the whole algorithm took around 5us on the same computer. — nmsd, Jul 18 '14 at 12:05
What I meant was the indirection performed when using `shared_ptr`, or any other kind of pointer. Using a `boost::variant` or something similar makes all the elements of the vector contiguous in memory, so that memory access is probably much faster (prefetching etc.). The cost of a virtual function call is usually just the missed opportunity of inlining (not much more expensive than a function call). `boost::variant` also needs some kind of dynamic dispatch. — dyp, Jul 18 '14 at 13:10

aaragon · Answer 1 · 2014-08-19T13:41:06.510

DISCLAIMER: This is not an answer, just a benchmark of the different approaches that can be used to solve the given problem.

The idea is to compare the boost::variant to other techniques (virtual functions on raw pointers, virtual function on shared pointers, and a couple of other techniques). Here is the benchmark code I used:

#include <vector>
#include <boost/foreach.hpp>
#include <boost/shared_ptr.hpp>
#include <boost/variant.hpp>
#include <memory>
#include <type_traits>

namespace novirtual {

  struct C {
    void f() {}
    int g(const double &x) { return 1; }
  };

  void test(int nbSample) {

    std::vector<C> vec;
    vec.reserve(nbSample);
    for (int i = 0; i < nbSample; ++i)
      vec.emplace_back(C());

    double argdouble = 0.0;
    BOOST_FOREACH(C &c, vec) {
      c.f();
      c.g(argdouble);
    }
  }
}
namespace crtp {

// Definition of the interface (abstract class).
template <typename C> struct CBase {
  void f() { static_cast<C &>(*this).f(); }
  int g(const double &x) { return static_cast<C &>(*this).g(x); }
};

// Definition of the first implementation.
struct C1 : public CBase<C1> {
  void f();
  int g(const double &x);
};
void C1::f() { return; }
int C1::g(const double &x) { return sizeof(x); }

// Definition of the second implementation.
struct C2 : public CBase<C2> {
  void f();
  int g(const double &x);
};
void C2::f() { return; }
int C2::g(const double &x) { return sizeof(x); }

// Definition of the visitor for the first function of the interface.
class f_visitor : public boost::static_visitor<int> {
public:
  template <typename C> int operator()(CBase<C> &c) const {
    c.f();
    return 0;
  }
};

// Definition of the visitor for the second function of the interface.
struct g_visitor : public boost::static_visitor<int> {
  const double &x;
  g_visitor(const double &x) : x(x) {}

public:
  template <typename C> int operator()(CBase<C> &c) const { return c.g(x); }
};

// Example of use: construct a random collection and visit it.
void test(int nbSample) {
  typedef boost::variant<C1, C2> CV;
  std::vector<CV> vec;
  vec.reserve(nbSample);
  for (int i = 0; i < nbSample; ++i) {
    switch (std::rand() % 2) {
    case 0:
      vec.push_back(C1());
      break;
    case 1:
      vec.push_back(C2());
      break;
    }
  }

  double argdouble;
  BOOST_FOREACH(CV & c, vec) {
    boost::apply_visitor(f_visitor(), c);
    g_visitor g(argdouble);
    boost::apply_visitor(g, c);
  }
}
}

namespace virt_fun {
struct CBase {
  virtual void f() = 0;
  virtual int g(const double &x) = 0;
};

struct C1 : public CBase {
  void f() {}
  int g(const double &x) { return 1; }
};
struct C2 : public CBase {
  void f() {}
  int g(const double &x) { return 2; }
};

void test(int nbSample) {
  std::vector<CBase *> vec;
  vec.reserve(nbSample);
  for (int i = 0; i < nbSample; ++i) {
    switch (std::rand() % 2) {
    case 0:
      vec.push_back(new C1());
      break;
    case 1:
      vec.push_back(new C2());
      break;
    }
  }

  double argdouble = 0.0;
  BOOST_FOREACH(CBase * c, vec) {
    c->f();
    c->g(argdouble);
  }
}
}

namespace shared_ptr {
struct CBase {
  virtual void f() = 0;
  virtual int g(const double &x) = 0;
};

struct C1 : public CBase {
  void f() {}
  int g(const double &x) { return 1; }
};
struct C2 : public CBase {
  void f() {}
  int g(const double &x) { return 2; }
};

void test(int nbSample) {
  typedef boost::shared_ptr<CBase> CV;
  std::vector<CV> vec;
  vec.reserve(nbSample);
  for (int i = 0; i < nbSample; ++i) {
    switch (std::rand() % 2) {
    case 0:
      vec.push_back(CV(new C1()));
      break;
    case 1:
      vec.push_back(CV(new C2()));
      break;
    }
  }

  double argdouble = 0.0;
  BOOST_FOREACH(CV & c, vec) {
    c->f();
    c->g(argdouble);
  }
}
}

namespace virt_cont {

struct CBase {
  virtual void f() = 0;
  virtual int g(const double &x) = 0;
  virtual ~CBase() = default;
};

struct C1 final : public CBase {
  void f() {}
  int g(const double &x) { return 1; }
};
struct C2 final : public CBase {
  void f() {}
  int g(const double &x) { return 2; }
};

struct foo {
  std::aligned_storage<sizeof(C2)>::type buf;
  CBase *ptr;

  foo(C1 c) { ptr = new ((void *)&buf) C1(c); }
  foo(C2 c) { ptr = new ((void *)&buf) C2(c); }

  foo(foo &&x) : buf(x.buf) { ptr = reinterpret_cast<CBase *>(&buf); } // UB

  foo &operator=(foo &&x) {
    buf = x.buf;
    return *this;
  } // maybe UB?

  ~foo() { ptr->~CBase(); }
};
void test(int nbSample) {
  std::vector<foo> vec;
  vec.reserve(nbSample);
  for (int i = 0; i < nbSample; ++i) {
    switch (std::rand() % 2) {
    case 0:
      vec.emplace_back(C1());
      break;
    case 1:
      vec.emplace_back(C2());
      break;
    }
  }

  double argdouble = 0.0;
  BOOST_FOREACH(foo & c, vec) {
    c.ptr->f();
    c.ptr->g(argdouble);
  }
}
}

namespace locals {
struct CBase {
  virtual void f() = 0;
  virtual int g(const double &x) = 0;
  virtual ~CBase() = default;
};

struct C1 final : public CBase {
  void f() {}
  int g(const double &x) { return 1; }
};
struct C2 final : public CBase {
  void f() {}
  int g(const double &x) { return 2; }
};

void test(int nbSample) {
  C1 c1;
  C2 c2;

  std::vector<CBase *> vec;
  vec.reserve(nbSample);
  for (int i = 0; i < nbSample; ++i) {
    switch (std::rand() % 2) {
    case 0:
      vec[i] = &c1;
      break;
    case 1:
      vec[i] = &c2;
      break;
    }
  }

  double argdouble = 0.0;
  BOOST_FOREACH(CBase * c, vec) {
    c->f();
    c->g(argdouble);
  }
}
}

#include <chrono>
#include <string>

#define VER 4

int main(int argc, char *argv[]) {

  int n = 100000;
  for (int i = 0; i < 4; ++i) {

    std::chrono::time_point<std::chrono::system_clock> start, end;
    start = std::chrono::system_clock::now();

#if VER == 0
    virt_fun::test(n);
#elif VER == 1
    shared_ptr::test(n);
#elif VER == 2
    crtp::test(n);
#elif VER == 3
    virt_cont::test(n);
#elif VER == 4
    locals::test(n);
#endif

    end = std::chrono::system_clock::now();

    std::chrono::duration<double> elapsed_seconds = end - start;
    std::time_t end_time = std::chrono::system_clock::to_time_t(end);

    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

    n *= 10;
  }
  return 0;
}

The code was compiled as clang++ -O3 main.cpp -I/Users/aaragon/Local/include, with

$ clang++ --version
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix

and run on a Macbook Pro with a 2.6 GHz Intel Core i7 processor and 16 GB 1600 MHz DDR3 ram.

These results take into account the bug fixes to the original code, plus additional code provided by @dyp that uses a wrapper class with std::aligned_storage. In the table below, the no virtual column corresponds to no inheritance at all and is given as reference.

  |    size   | no virtual| raw ptr   | shared_ptr |  variant  | wrapper   |  locals   |
  |-----------|-----------|------------|-----------|-----------|-----------|-----------|
  |    100000 | 0.000235s | 0.008309s |  0.030801s | 0.003935s | 0.004222s | 0.001925s |
  |   1000000 | 0.002053s | 0.061843s |  0.288403s | 0.029874s | 0.033697s | 0.01478s  |
  |  10000000 | 0.017687s | 0.627659s |  2.91868s  | 0.29699s  | 0.322109s | 0.141245s |
  | 100000000 | 0.177425s | 6.2493s   | 28.9586s   | 3.00427s  | 3.21402s  | 1.40478s  |

At this stage something's certain: boost::shared_ptr is extremely slow, and there's no clear winner between boost::variant and the wrapper.

This is not an answer. Arguably, it should be a comment (which had to be shorter, then). -- Virtual functions are typically not "slow" because of the indirection, but because of the missed opportunity for inlining. When compiling without optimizations, no inlining takes place at all (=> cost of a function call, less optimization of the code at the call site). I guess `boost::variant` even performs some dispatching steps, which requires several function calls. Those would be eliminated under `-O3`. — dyp, Aug 19 '14 at 09:41
@dyp I have no way to put all that info in a comment... I'm not trying to answer the question, I'm just looking for an insight on the same matter. nmsd doesn't have to check my answer as the right answer. — aaragon, Aug 19 '14 at 09:47
I'm not even able to run the test for 10^9 rounds (using g++): The `shared_ptr` consumes 16 byte, plus each `new`ed instance 8 byte (maybe plus overhead), i.e. at least 24*10^9 byte unless the compiler can optimize out some of those allocations. — dyp, Aug 19 '14 at 10:02
Note that there are two bugs in the OP's code for virtual functions: a `% 5` instead of `% 2` and a `bool` return type instead of `void`. — dyp, Aug 19 '14 at 10:05
I fixed the bool one, but the other one I didn't check. I'm amazed though, and I wonder why going through the virtual function approach at all if there's a much more efficient way to do this. Convenience? — aaragon, Aug 19 '14 at 10:09
The OP knows that `boost::variant` is much more efficient in this case, but it requires quite some boilerplate code. That *is* in essence the OP's question: How to get the efficiency of the `boost::variant` approach, but with the elegance/simplicity of the virtual function approach. — dyp, Aug 19 '14 at 10:11
I'm interested in this because I have a similar problem, where I use a visitor on a hierarchy of objects with the same interface that derived from a base class. But a visitor implementation I was using also relies on a virtual member function call. The problem is that these virtual function calls are called in computationally intensive loops. But this approach may be much faster in that case. @dyp would you say that the boost::variant approach is as fast as just calling non-virtual function calls from a class that doesn't belong to any hierarchy due to the inlining? — aaragon, Aug 19 '14 at 10:40
You might want to try [this third version](http://pastebin.com/pxxNL8je). Unfortunately, it contains some minor UB (aliasing problem); but it's about as fast as the `boost::variant` solution while using virtual functions. — dyp, Aug 19 '14 at 10:40
As I've said in the comments to the OP, I think the main problem in the OP's "virtual function" approach are not the virtual function calls, but the indirection via `shared_ptr` (and the additional heap allocation, thread-safe `shared_ptr` operations etc.). The difference between the *ratios* of the unoptimized and optimized versions *could* be due to function call overhead, but that's purely speculative. — dyp, Aug 19 '14 at 10:43
I removed the `boost::shared_ptr` dependency and I stored only pointers to the base class (**UPDATE 2**). Indeed it works much better, but the `boost::variant` is still faster. So since we have the allocation in the heap issue, how can we really see quantitatively the comparison? We have to consider as well that in this case there are only 2 derived classed, in the case of having a much larger hierarchy the virtual function table will be much larger, and I don't know how compilers implement this but it would take much time if the search in this table for the virtual function call is linear. — aaragon, Aug 19 '14 at 10:53
Wait... there's *still* a bug in the OP's code: the switch-statement uses cases `1` and `2` instead of `0` and `1`... — dyp, Aug 19 '14 at 10:54
True, I will fix the bugs and I'll update the table results in a minute. — aaragon, Aug 19 '14 at 10:56
*"in the case of having a much larger hierarchy the virtual function table will be much larger"* The size of the vtable that is relevant here is only dependent on the number of virtual functions in `CBase`. *"but it would take much time if the search in this table for the virtual function call is linear."* It is typically an array, lookup in O(1). — dyp, Aug 19 '14 at 10:57
(Probably cheating now, but when using locals `C1 c1; C2 c2;` and a `vector vec;`, initialized with `vec[i] = rand() % 2 ? &c1 : &c2;`, it's even faster than `boost::variant`.) — dyp, Aug 19 '14 at 11:01
So then I really don't get it. The whole issue has nothing to do with the fact the the calls are virtual but because of the fact that objects are allocated in the heap? — aaragon, Aug 19 '14 at 11:09
@dyp I rerun all the tests and checked the timing results after fixing all bugs. The new results are given in the tables of the two last updates. As you can see, there's a slight difference that favors the `boost::variant` with respect to the raw pointer. But I don't know what to get as a conclusion from this little exercise, as I don't know if the difference can be attributed to the virtual function calls or to the fact that the objects are allocated in the heap. What's your final take on this issue? And if I had to make a choice for the problem I described, what would be your suggestion? — aaragon, Aug 19 '14 at 11:34
Results on my machine, clang++3.5, 10^8 rounds: `boost::variant` 2.2s, `shared_ptr` 10.4s, `foo` 1.5s, locals 1.17s -- Disclaimer: "Want performance? Measure [your actual problem]." My conclusion from these results is that using virtual functions or `boost::variant` doesn't make much difference. The `shared_ptr` solution is much slower, *maybe* due to additional heap allocations (one could try to find that out by using an arena allocator). As I suggested in the comments to the OP: It depends on what you want. You can try reduce boilerplate with virtual functions. Measure. — dyp, Aug 19 '14 at 12:19
@dyp look at **UPDATE 3** above, I couldn't get an advantage over `boost::variant` as you claim. Also, can you point out your 'locals' implementation? — aaragon, Aug 19 '14 at 12:22
This could be compiler-version and/or machine-dependent *shrug* — dyp, Aug 19 '14 at 12:24
@dyp Thanks! Well I updated my benchmark results and I added the final code I used for all of them in case people want to take a look. So far the fastest solution is the use of local variables, but I guess this is not practical at all, is it? I guess the only way we could use this approach would be by using `static` memory, as the variables are destroyed at the end of the scope of the function. — aaragon, Aug 19 '14 at 12:47
I have written a very simple stack-like allocator for the `std::shared_ptr` version and used `std::allocate_shared` to construct the shared pointers. The results are about as fast as the wrapper solutions; so I guess the most overhead really comes from the additional memory allocations. (code too ugly and buggy to publish) — dyp, Aug 19 '14 at 13:09
I added a last test for reference: when no considering any inheritance, so the comparison is made to function calls to objects of the same type `C`. This also blows my mind, the difference is way too much... =/ — aaragon, Aug 19 '14 at 15:05

Automatic definition of visitors from CRTP (CRTP using boost foreach with boost variant)

1 Answers1