Virtual function performance: one large class vs many smaller subclasses

Question

I am in the process of refactoring a c++ OpenGL app I made (technically, an app that makes heavy use of the thin OpenGL wrapper in Qt's QQuickItem class). My app is performing ok but could be better.

One of the issues I'm curious about relates to the use of virtual functions in very time-sensitive (frame rate) algorithms. My OpenGL drawing code calls many virtual functions on the various objects that need drawing. Since this happens many times per second, I am wondering if the virtual dispatch could bring down the frame rate.

I am thinking about changing to this structure instead which avoids inheritance by keeping everything in one base class, but where the previously virtual functions now simply contain switch statements to call the appropriate routine based on the class's "type" which is really just a typedef enum:

Previously:

struct Base{
  virtual void a()=0;
  virtual void b()=0;
}

struct One : public Base{
  void a(){...}
  void b(){...}
}

Considering:

struct Combined{

  MyEnumTypeDef t; //essentially holds what "type" of object this is

  void a(){

    switch (t){

     case One:
       ....
       break;

      case Two:
       ....
       break;
     }
   }
  }

When function a() is called very often in OpenGL drawing routines, I am tempted to think that the Combined class will be considerably more efficient since it does not require the dynamic dispatch on virtual tables.

I'd appreciate some advice on this issue, if this is wise or not.

Don't speculate. Profile the code and then make your decisions based on the results. — , Apr 10 '14 at 00:46
@BleepBloop true but there are limited options for profiling functions in Qt Creator on Windows, unfortunately. — johnbakers, Apr 10 '14 at 00:49
"Don't speculate; profile" is of course the best answer. The second-best answer is that a switch will almost certainly compile into code that is slower than a virtual function call (or, at best, the same speed). Modern CPUs have incorporated many tricks to make virtual function calls fast, or at least faster than they used to be, because modern OO languages tend to emit them a lot. — Nemo, Apr 10 '14 at 01:28
@Nemo: I agree about the switch, as I also wrote that, but runtime speed optimization is not a side effect concern for performance critical code. It is a fundamental architectural and design decision that one aims in such projects. It is too late to rewrite the whole software when you start benchmarking. — László Papp, Apr 10 '14 at 01:55
`Modern CPUs have incorporated many tricks to make virtual function calls fast, or at least faster than they used to be, because modern OO languages tend to emit them a lot.` -> wait, what? you have a hardware, and then someone invents wrong programming mechanism for that; then you improve the hardware to accomodate that improper software mechanism? It feels like fishying the fishy stuff. Wouldn't it be outright better to fix the programming mechanism you use? Also, powerful desktop PCs are not the only market. There are consumer devices, like game consoles, embedded micro controllers, etc. — László Papp, Apr 10 '14 at 02:03
By the way, this question has not much to do with OpenGL and Qt. Please consider removing those tags. It is a general graphics programming question. — László Papp, Apr 10 '14 at 02:14
Do you have some *reason* to think a switch statement is faster than a virtual function call? — user207421, Apr 10 '14 at 02:45
@laszlopapp but your answer is in the context of graphics programming, yet you claim the tag is not relevant? — johnbakers, Apr 10 '14 at 05:08
@OpenLearner: correct, graphics programming exists outside Qt and OpenGL, and in this case, it is not related to those. It is a generic question as far as I understand. — László Papp, Apr 10 '14 at 05:10
@OpenLearner About how many times do you hit the code you're asking about per frame, average and worst case? Also what is your target frame rate and what frame rate are you actually observing (worst case)? — Jason C, Apr 10 '14 at 14:21
@JasonC: you would also need to know the cache size and data to carry, lookahead buffer stuff, etc, albeit any spatially distributed 32 bits per item for large containers might eat up a "significant part" of the cache. You would also need to know how much intrinsics you need to use, whether you have too many overlapping pointers to "restrict" against, etc. — László Papp, Apr 10 '14 at 15:34
@LaszloPapp Yes, of course, but every answer below is unfortunate speculation if the OP keeps his requirements a secret! For all we know he could already be hitting 4000 FPS (far higher than any display refresh rate) or he could be struggling to meet requirements but only calling the above a few hundred times per frame (in which case effort should be directed elsewhere). This question *really* should have been closed as "insufficient information". :) — Jason C, Apr 10 '14 at 16:45
Yes, I agree that more information is needed for the actual solution. The best people can do in such situations is giving some rule of thumb with pros and const for certain design decisions. — László Papp, Apr 10 '14 at 16:47
@OpenLearner [This](http://www.codeproject.com/Articles/7150/Member-Function-Pointers-and-the-Fastest-Possible) is precisely the article you want to read. It addresses the issues raised here exhaustively, and is an extremely well written and informative article (make sure you make some coffee first, though - it is not short). — Jason C, Apr 11 '14 at 22:54

score 14 · Accepted Answer · edited May 23 '17 at 11:58

In your case, it probably does not matter. I say probably because, and I mean this constructively, the fact that you did not specify performance requirements and did not specify how often the function in question is called indicates that you may not have enough information to make a judgment now - the "don't speculate: profile" blanket response actually only intends to make sure you have all the necessary information required, because premature micro-optimizations are very common and our real goal is to help you out in the big picture.

Jeremy Friesner really hit the nail on the head with his comment on another answer here:

If you don't understand why it is slow you won't be able to speed it up.

So, with all that in mind, assuming that either A) Your performance requirements are already being met (e.g. you are pulling 4000 FPS - far higher than any display refresh rate) or B) You are struggling to meet performance requirements and this function is only called a few (say < 1000-ish) times per frame) or C) You are struggling to meet performance requirements and this function is called often but does a lot of other significant work (and thus function call overhead is negligible), then:

Using a virtual function may, at most, end up in one extra lookup in a table somewhere (and possibly some cache misses - but not so much if it is accessed repeatedly in, say, an inner loop), which is a few CPU clock cycles worst case (and most likely still less than a switch, although that is really moot here), and this is completely insignificant compared to your target frame rate, the effort required to render a frame, and any other algorithms and logic you are performing. If you'd like to prove it to yourself, profile.

What you should do is use whatever technique leads to the clearest, cleanest, most maintainable and readable code. Micro-optimizations such as this are not going to have an effect, and the cost in code maintainability, even if it is minor, is not worth the benefit, which is essentially zero.

What you should also do is sit down and get a handle on your actual situation. Do you need to improve performance? Is this function called enough to actually have a significant impact or should you be concentrating on other techniques (e.g. higher level algorithms, other design strategies, off-loading computation to the GPU or using machine-specific optimizations e.g. bulk operations with SSE, etc.)?

One thing you could do, in the absence of concrete information, is try both approaches. While performance will differ from machine to machine, you could at least get a rough idea of the impact of this particular bit of code on your overall performance (e.g. if you're shooting for 60 FPS, and these two options give you 23.2 FPS vs. 23.6 FPS, then this isn't where you want to focus, and possible sacrifices made by choosing one of these strategies over the other may not be worth it).

Also consider using call lists, vertex index buffers, etc. OpenGL provides many facilities for optimizing drawing of objects where certain aspects remain constant. For example, if you have a huge surface model with small parts whose vertex coordinates change often, divide the model into sections, using call lists, and only update the call list for the section that changed when it has changed since the last redraw. Leave e.g. coloring and texturing out of the call list (or use coordinate arrays) if they change often. That way you can avoid calling your functions altogether.

If you're curious, here is a test program (which probably does not represent your actual usage, again, this is not possible to answer with the information given - this test is the one requested in comments below). This does not mean these results will be reflected in your program and, again, you need to have concrete information about your actual requirements. But, here this is just for giggles:

This test program compares a switch-based operation vs. a virtual-function based operation vs. pointer-to-member (where member is called from another class member function) vs. pointer-to-member (where member is called directly from test loop). It also performs three types of tests: A run on a dataset with just one operator, a run that alternates back and forth between two operators, and a run that uses a random mix of two operators.

Output when compiled with gcc -O0, for 1,000,000,000 iterations:

$ g++ -O0 tester.cpp
$ ./a.out 
--------------------
Test: time=6.34 sec (switch add) [-358977076]
Test: time=6.44 sec (switch subtract) [358977076]
Test: time=6.96 sec (switch alternating) [-281087476]
Test: time=18.98 sec (switch mixed) [-314721196]
Test: time=6.11 sec (virtual add) [-358977076]
Test: time=6.19 sec (virtual subtract) [358977076]
Test: time=7.88 sec (virtual alternating) [-281087476]
Test: time=19.80 sec (virtual mixed) [-314721196]
Test: time=10.96 sec (ptm add) [-358977076]
Test: time=10.83 sec (ptm subtract) [358977076]
Test: time=12.53 sec (ptm alternating) [-281087476]
Test: time=24.24 sec (ptm mixed) [-314721196]
Test: time=6.94 sec (ptm add (direct)) [-358977076]
Test: time=6.89 sec (ptm subtract (direct)) [358977076]
Test: time=9.12 sec (ptm alternating (direct)) [-281087476]
Test: time=21.19 sec (ptm mixed (direct)) [-314721196]

Output when compiled with gcc -O3, for 1,000,000,000 iterations:

$ g++ -O3 tester.cpp ; ./a.out
--------------------
Test: time=0.87 sec (switch add) [372023620]
Test: time=1.28 sec (switch subtract) [-372023620]
Test: time=1.29 sec (switch alternating) [101645020]
Test: time=7.71 sec (switch mixed) [855607628]
Test: time=2.95 sec (virtual add) [372023620]
Test: time=2.95 sec (virtual subtract) [-372023620]
Test: time=14.74 sec (virtual alternating) [101645020]
Test: time=9.39 sec (virtual mixed) [855607628]
Test: time=4.20 sec (ptm add) [372023620]
Test: time=4.21 sec (ptm subtract) [-372023620]
Test: time=13.11 sec (ptm alternating) [101645020]
Test: time=9.32 sec (ptm mixed) [855607628]
Test: time=3.37 sec (ptm add (direct)) [372023620]
Test: time=3.37 sec (ptm subtract (direct)) [-372023620]
Test: time=13.08 sec (ptm alternating (direct)) [101645020]
Test: time=9.74 sec (ptm mixed (direct)) [855607628]

Note that -O3 does a lot, and without looking at the assembler, we cannot use this as a 100% accurate representation of the problem at hand.

In the unoptimized case, we notice:

Virtual outperforms switch in runs of a single operator.
Switch outperforms virtual in cases where multiple operators used.
Pointer-to-member when the member is called directly (object->*ptm_) is similar to, but slower than, virtual.
Pointer-to-member when the member is called through another method (object->doit() where doit() calls this->*ptm_) takes a little under twice the time.
As expected, "mixed" case performance suffers due to branch prediction failures.

In the optimized case:

Switch outperforms virtual in all cases.
Similar characteristics for pointer-to-member as unoptimized case.
All "alternating" cases that involve function pointers at some point are slower than with -O0 and slower than "mixed" for reasons I do not understand. This does not occur on my PC at home.

What is especially significant here is how much the effects of e.g. branch prediction outweigh any choice of "virtual" vs. "switch". Again, be sure you understand your code and are optimizing the right thing.

The other significant thing here is this represents time differences on the order of 1-14 nanoseconds per operation. This difference could be significant for large numbers of operations, but is likely negligible compared to other things you are doing (note that these functions perform only a single arithmetic operation, any more than that will quickly dwarf the effects of virtual vs. switch).

Note also that while calling the pointer-to-member directly showed an "improvement" over calling it through another class member, this has potentially large impacts on overall design as such an implementation (at least in this case, where something outside the class calls the member directly) could not be dropped in as a direct replacement for another implementation due to different syntax for calling pointer-to-member functions (-> vs. ->*). I had to create a whole separate set of test cases just to handle that, for example.

Conclusion

Performance difference will easily be dwarfed by even a couple extra arithmetic operations. Note also that branch prediction has a far more significant impact in all but the "virtual alternating" case with -O3. However, test is also unlikely to be representative of actual application (which the OP has kept a secret), and -O3 introduces even more variables, and so results must be taken with a grain of salt and are unlikely to be applicable to other scenarios (in other words, test may be interesting, but is not particularly meaningful).

Source:

// === begin timing ===
#ifdef __linux__
#  include <sys/time.h>
typedef struct timeval Time;
static void tick (Time &t) {
  gettimeofday(&t, 0);
}
static double delta (const Time &a, const Time &b) {
  return
    (double)(b.tv_sec - a.tv_sec) +
    (double)(b.tv_usec - a.tv_usec) / 1000000.0;
}
#else // windows; untested, working from memory; sorry for compile errors
#  include <windows.h>
typedef LARGE_INTEGER Time;
static void tick (Time &t) {
  QueryPerformanceCounter(&t);
}
static double delta (const Time &a, const Time &b) {
  LARGE_INTEGER freq;
  QueryPerformanceFrequency(&freq);
  return (double)(b.QuadPart - a.QuadPart) / (double)freq.QuadPart;
}
#endif
// === end timing

#include <cstdio>
#include <cstdlib>
#include <ctime>

using namespace std;

// Size of dataset.
static const size_t DATASET_SIZE = 10000000;

// Repetitions per test.
static const unsigned REPETITIONS = 100;


// Class performs operations with a switch statement.
class OperatorSwitch {
public:
  enum Op { Add, Subtract };
  explicit OperatorSwitch (Op op) : op_(op) { }
  int perform (int a, int b) const {
    switch (op_) {
    case Add: return a + b;
    case Subtract: return a - b;
    }
  }
private:
  Op op_;
};


// Class performs operations with pointer-to-member.
class OperatorPTM {
public:
  enum Op { Add, Subtract };
  explicit OperatorPTM (Op op) {
    perform_ = (op == Add) ? 
      &OperatorPTM::performAdd :
      &OperatorPTM::performSubtract;
  }
  int perform (int a, int b) const { return (this->*perform_)(a, b); }
  int performAdd (int a, int b) const { return a + b; }
  int performSubtract (int a, int b) const { return a - b; }
  //private:
  int (OperatorPTM::*perform_) (int, int) const;
};


// Base class for virtual-function test operator.
class OperatorBase {
public:
  virtual ~OperatorBase () { }
  virtual int perform (int a, int b) const = 0;
};

// Addition
class OperatorAdd : public OperatorBase {
public:
  int perform (int a, int b) const { return a + b; }
};

// Subtraction
class OperatorSubtract : public OperatorBase {
public:
  int perform (int a, int b) const { return a - b; }
};


// No base

// Addition
class OperatorAddNoBase {
public:
  int perform (int a, int b) const { return a + b; }
};

// Subtraction
class OperatorSubtractNoBase {
public:
  int perform (int a, int b) const { return a - b; }
};



// Processes the dataset a number of times, using 'oper'.
template <typename T>
static void test (const int *dataset, const T *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i)
      result = oper->perform(result, dataset[i]);

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Processes the dataset a number of times, alternating between 'oper[0]'
// and 'oper[1]' per element.
template <typename T>
static void testalt (const int *dataset, const T * const *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i)
      result = oper[i&1]->perform(result, dataset[i]);

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Processes the dataset a number of times, choosing between 'oper[0]'
// and 'oper[1]' randomly (based on value in dataset).
template <typename T>
static void testmix (const int *dataset, const T * const *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i) {
      int d = dataset[i];
      result = oper[d&1]->perform(result, d);
    }

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Same as test() but calls perform_() pointer directly.
static void test_ptm (const int *dataset, const OperatorPTM *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i)
      result = (oper->*(oper->perform_))(result, dataset[i]);

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Same as testalt() but calls perform_() pointer directly.
static void testalt_ptm (const int *dataset, const OperatorPTM * const *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i) {
      const OperatorPTM *op = oper[i&1];
      result = (op->*(op->perform_))(result, dataset[i]);
    }

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Same as testmix() but calls perform_() pointer directly.
static void testmix_ptm (const int *dataset, const OperatorPTM * const *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i) {
      int d = dataset[i];
      const OperatorPTM *op = oper[d&1];
      result = (op->*(op->perform_))(result, d);
    }

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


int main () {

  int *dataset = new int[DATASET_SIZE];
  srand(time(NULL));
  for (int n = 0; n < DATASET_SIZE; ++ n)
    dataset[n] = rand();

  OperatorSwitch *switchAdd = new OperatorSwitch(OperatorSwitch::Add);
  OperatorSwitch *switchSub = new OperatorSwitch(OperatorSwitch::Subtract);
  OperatorSwitch *switchAlt[2] = { switchAdd, switchSub };
  OperatorBase *virtAdd = new OperatorAdd();
  OperatorBase *virtSub = new OperatorSubtract();
  OperatorBase *virtAlt[2] = { virtAdd, virtSub };
  OperatorPTM *ptmAdd = new OperatorPTM(OperatorPTM::Add);
  OperatorPTM *ptmSub = new OperatorPTM(OperatorPTM::Subtract);
  OperatorPTM *ptmAlt[2] = { ptmAdd, ptmSub };

  while (true) {
    printf("--------------------\n");
    test(dataset, switchAdd, "switch add");
    test(dataset, switchSub, "switch subtract");
    testalt(dataset, switchAlt, "switch alternating");
    testmix(dataset, switchAlt, "switch mixed");
    test(dataset, virtAdd, "virtual add");
    test(dataset, virtSub, "virtual subtract");
    testalt(dataset, virtAlt, "virtual alternating");
    testmix(dataset, virtAlt, "virtual mixed");
    test(dataset, ptmAdd, "ptm add");
    test(dataset, ptmSub, "ptm subtract");
    testalt(dataset, ptmAlt, "ptm alternating");
    testmix(dataset, ptmAlt, "ptm mixed");
    test_ptm(dataset, ptmAdd, "ptm add (direct)");
    test_ptm(dataset, ptmSub, "ptm subtract (direct)");
    testalt_ptm(dataset, ptmAlt, "ptm alternating (direct)");
    testmix_ptm(dataset, ptmAlt, "ptm mixed (direct)");
  }

}

Virtual functions is not about extra cycles, but cache hits. You get an extra pointer to get through the cache, which can basically break the cache usage significantly. This is not a micro-optimization, but fundamental architectural decision for graphics code. Try out what I suggested in my answer. Do any dummy pixel operation for millions of pixel per frame with virtual functions, and you will see that the cache will go haywire causing a significant performance overhead. — László Papp, Apr 10 '14 at 01:31
It is too bad that the OP did not specify how often these are called per frame, and also did not indicate whether or not his current performance requirements were not being met. — Jason C, Apr 10 '14 at 01:35
That is why I wrote "If it is 10, it is probably acceptable, but if it is 10 K, it would get more critical.". — László Papp, Apr 10 '14 at 01:37
@LaszloPapp Added test as described by your comment, with 1,000,000,000 operations per test. Note that time difference is on order of 1-14 nanoseconds, which will easily be dwarfed by even a couple extra arithmetic operations. Note also that branch prediction has a far more significant impact in all but the "virtual alternating" case with `-O3`. — Jason C, Apr 10 '14 at 18:19
The problem is that your test is faulty because it compares switch to virtual which was initially said by me not to make a huge difference, and then others. There are alternative tricks to switch, eventually mentioned in my post, hence you are comparing apple to orange here. ;-) You are **still** comparing bruteforce runtime vs. runtime, and not runtime bruteforce vs. more clever tricks. Your test work is good for seeing some benchmark about switch vs. virtualism, but in my understanding that was not much of concern here for people except perhaps the OP. Oh, and it is PC test, not embedded. — László Papp, Apr 10 '14 at 18:45
@LaszloPapp Hm, I'm confused; the OP was asking about switch vs. virtual and also did not imply that he was using an embedded device. CRTP is a difficult pattern to test because any use of a common base introduces virtual functions back into the picture, so it becomes the same as the virtual case for the alternating/mixed cases. Can you propose a 3rd test? If you are saying that higher level alternatives should be explored (e.g. sorting by type, etc.) then I believe we are saying the same thing, as that is my main point. :) — Jason C, Apr 10 '14 at 19:06
@LaszloPapp For your template + sort suggestion, are you proposing to run through every pixel ahead of time, find out which operation is applied to that pixel, group pixels by operation, *then* perform the operations in like groups? If the operations change per frame that would have to be done per frame. That's a lot of overhead, both in memory and in CPU, plus a lot of extra work required if new types are added. Note also that the OP pretty clearly implied that these were per scene object operations, not per pixel. A better approach would be e.g. GL call lists for objects that don't change. — Jason C, Apr 10 '14 at 19:17
1) Performance really matters on embedded devices, including DSP, GPGPU etc. Desktop CPUs architectures have _not_ been meant for such cases, and that is why you do not find them in game consoles. 2) I have never said CRTP... 3) It is not about higher-level alternatives, but different alternatives. 4) Err... there is not even nearly as much extra work in non-runtime decisions, or in runtime decision such as LUT, intrinsics, VLW architecture, etc. 5) Err... you can have actually as many screen objects as pixels in worst case complexity... — László Papp, Apr 10 '14 at 19:21
@LaszloPapp 1) Nobody is arguing that performance doesn't matter on embedded devices. But this question has nothing to do with embedded devices! 2/3) Different alternatives here *are* higher level; higher than what the OP is asking about, and generally requiring significant redesign. 4) That's true, but also the OPs case involves types that *are not known at compile time*, hence the switch. 5) Also your assertion that you can have as many screen objects as you want is correct but isn't really a meaningful basis for an answer given that we have no information from the OP about what is going on. — Jason C, Apr 10 '14 at 19:24
I think perhaps you are trying to come up with a general rule of thumb for a situation that just cannot be generalized. Your points are very good and valid in certain situations, and not in others. A *lot* of assumptions are made in your answer, and if the OP takes it as *general* advice, it will not always be beneficial. (This is also a strong case for benchmark -> optimize.) — Jason C, Apr 10 '14 at 19:27
1) You cannot really avoid discussing embedded and performance when it is mostly critical in games, etc. You are yet to show a PC CPU architecture based console. 2) Again, I have never said CRTP. 3) No, it is not higher-level. It is different alternative. Template, LUT, etc are definitely not "higher-level" than polymorphism. In fact, one could say intrinics are even lower-level. 4) LUT, intrinsics, VLIW, and I could continue... it is not about compile time at all. Polymorphism is _really_ not the only way to write runtime code. — László Papp, Apr 10 '14 at 19:33
5) It is totally meaningful if you design the algorithm for worst case complexity here, which is oftentimew the case anyhow. You cannot tell that to the game player, oopsie, I have to load 10 or 100K particles now to have a cool effect, but oopsie, it will be very slow, sorry, or I could mention real-time flight simulation, collision detection in automotive, and what not where human lives may rely on the feature. — László Papp, Apr 10 '14 at 19:34
1) You can avoid discussing embedded systems entirely when you are talking about developing on a PC. I'm *really* not sure where this assertion is coming from. You set your performance testing against minimum system requirements that you define. You don't try and write code that will meet performance requirements when run on a 50MHz processor. Technology advances for a reason, and not taking advantage of it doesn't make sense. — Jason C, Apr 10 '14 at 19:37
5) You design the algorithm for worst case complexity *in your application*. If your program *never* loads 100K particles, then you don't have to make it work for that case. "Worst case" is well-defined, it's not an infinite lower bounds. — Jason C, Apr 10 '14 at 19:38
@LaszloPapp [Here are the minimum system requirements for Bioshock Infinite](http://www.ign.com/wikis/bioshock-infinite/PC_System_Requirements). Challenge: Convince the development team that they can make their software work well on a machine from 10 years ago *and* release a good product in a timely manner at the same time. — Jason C, Apr 10 '14 at 19:40
@LaszloPapp You are talking about "embedded devices", which is what I am referring to, and perhaps we have a different definition of "embedded". When I refer to "embedded" I am referring to extremely resource constrained systems, e.g. microcontrollers and miniature computers. I am not referring to, e.g., an [1.75GHz 8-core, 8GB RAM, 853 MHz GPU](http://www.ign.com/wikis/xbox-one/Xbox_One_Hardware_Specs) XBox One. Most gaming consoles from the last 5 years outperform even the best of desktop PC's in terms of graphics. If you call these "embedded", it's not really a constraint relative to PC... — Jason C, Apr 10 '14 at 19:49
I love tests like this. I also proposed another in the other comments. This is the accepted answer now, thanks. — johnbakers, Apr 10 '14 at 23:53
@OpenLearner Haven't done pointer-to-member tests yet but just some more data points: Moving the classes to another compilation unit had minimal effect (that is, inlining is not a significant factor here), and when I tested on my PC at home, "virtual alternating" with -O3 wasn't a weird outlier. Still can't explain that. GCC 4.6 vs 4.8 produced the same results. I wonder if it's hardware related. — Jason C, Apr 11 '14 at 16:24
@OpenLearner There you go; pointer-to-member seems to have a bit more overhead compared to the virtual case (and more if you try to keep the class interface the same instead of calling the pointer-to-member directly from the test loop). Again, need to stress these results are not necessarily meaningful in general; just interesting (I think). — Jason C, Apr 11 '14 at 17:05
Unfortunately, your test is fundamentally flawed: you use only two different functions. This *is* important. Once you use something like five functions, `switch` will be much slower than virtual dispatch in all cases. `switch` is generally of complexity O(n) with n being the number of cases, while virtual dispatch is O(1). — cmaster - reinstate monica, Apr 11 '14 at 20:41
@cmaster That's a valid concern. I will increase the complexity when I get home from work. It depends on what the compiler does with the code though. A jump table or a computed offset (by multiplication with value) would not have linear complexity for example. Compilers are generally good about recognizing and optimizing patterns in switch blocks. — Jason C, Apr 11 '14 at 20:43
Interesting. So that article that suggested ptm instead of virtual for performance critical code was perhaps incorrect. I'm quite surprised that ptm direct actually takes longer than virtual which essentially is ptm + vtable lookup. — johnbakers, Apr 12 '14 at 00:35

score 2 · Answer 2 · answered Apr 11 '14 at 18:10

The model of "lots of objects that draw themselves" is attractive, but bad in a sneaky way. It's not the virtual function call overhead (which exists, but is small), it encourages a rendering anti-pattern: letting every object draw itself in isolation. It sounds like one of those things touted in "software engineering best practices", but it's not, it's very bad. Every object would make a lot of expensive API calls (such as binding shaders and textures). Now, I don't really know what your code looks like, maybe it doesn't work like this, the objects aren't necessarily bad, it's how they're used.

Anyway, here are some suggestions.

Sort your objects by the state (shader, texture, vertex buffer, in that order) they want (actually, don't sort - put them in buckets and iterate of those). This is easy, everyone does it, and it may be enough.

Merge states, so there's nothing to switch between. Use übershaders. Use texture arrays, or better, bindless textures (which doesn't have the problem that all slices have to be the same format/size/etc). Use a huge vertex buffer that you put everything into. Use uniform buffers. Use persistent mapping for the dynamic buffers.

And finally, glMultiDrawElementsIndirect. If you've put everything into buffers anyway, as per the previous suggestion, then you will need only very few calls to glMultiDrawElementsIndirect. Very few, you can do a lot with just one call. What you'd probably use otherwise is a bunch of glDrawArrays with no binding in between, which isn't bad either, but it's not much effort to make it even better.

The end result is that the actual drawing code has almost disappeared. Almost all API calls are gone, replaced by writes to buffers.

score 1 · Answer 3 · edited May 23 '17 at 11:43

1

It would be faster to not use virtual functions, but whether the difference is significant is hard to say. You should run your program through a profiler to see where it's spending its time. You might discover that the cpu cycles are spent on something entirely different and you'd be wasting your time (and degrading your design) by messing with the virtuals.

Unix: How can I profile C++ code running in Linux? Windows: What's the best free C++ profiler for Windows?

Another option to consider might be to use the Curiously Recurring Template Pattern: http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern to get similar polymorphism without using virtuals.

edited May 23 '17 at 11:43

Community

1
1

answered Apr 10 '14 at 00:15

user823981

102
6

true but there are limited options for profiling functions in Qt Creator on Windows, unfortunately. – johnbakers Apr 10 '14 at 00:49
note that the CRTP does not allow heterogeneous collections which are a common use of polymorphism with virtual functions so if you need those, templates might not be the best solution – Jerome Baldridge Apr 10 '14 at 01:17
1

@JXB: actually, you could consider sorting by type to avoid runtime polymorphism. See this, too: https://github.com/brandonpelfrey/SimpleOctree/blob/master/Octree.h – László Papp Apr 10 '14 at 02:08
It would be faster why? – user207421 Apr 10 '14 at 02:46
3

"It would be faster to not use virtual functions" is simply false, especially if the alternative is a `switch` statement. – Nemo Apr 10 '14 at 02:56
Limited options or not, profiling is the thing to do, otherwise you'll end up chasing superstitions rather than improving performance. – Jeremy Friesner Apr 10 '14 at 03:58
@JeremyFriesner: it is way way and _way_ too late to rewrite and redesign the software once you realize your architectural and design are wrong while benchmarking. People have experienced the negative outcome of your thinking in my environment, and they learned the mistake. The whole point of the question is performance, otherwise the question would not make any sense. – László Papp Apr 10 '14 at 04:39
1

@LaszloPapp I never said anything about rewriting or redesigning. The point of profiling is to understand what your program is doing that is slow. If you don't understand why it is slow you won't be able to speed it up. – Jeremy Friesner Apr 10 '14 at 14:38
@JeremyFriesner: I agree with you that you may be able to profile things that can be fixed in a retrofit style later, but the major issues should be addressed upfront in the design and architecture. In other words, it is not just a side-trait, but driving factor. If you design an architecture based on the principle of "ah, it might be slow and needs to be rewritten, but we do not know yet whether that is true", you might end up having a lot of cost for your company and project. There are things that can be foreseen and followed in advance is what I am trying to get through. – László Papp Apr 10 '14 at 14:41
@LaszloPapp That's fine and I don't disagree that performance ought to be taken into account at the design stage... but knowing that doesn't help when you are sitting in front of an already-written program that you now want to make faster. Whether you designed it for performance or not, you now have to profile it to find out why it is slow. Otherwise you end up "fixing" things that weren't broken, and leaving things broken that needed to be fixed, and typically doing more harm than good. – Jeremy Friesner Apr 10 '14 at 15:25
@JeremyFriesner: refactoring means redesigning the system for performing well. There are things that you do not have to measure. See my example given for several million operations per frame. There is really nothing to measure in there because you know even if you have an extra call, it will be called millions of times and will cause a lot of cache miss, inherently. Virtualism and runtime polymorphism are such things. These things do not have to be measured because the consequence is quite evident for them as outlined in my answer. – László Papp Apr 10 '14 at 15:31
@nemo Virtual calls can incur a performance hit, not just because of the vtbl access but more because compilers have a harder time optimizing vfunction calls (virtuals are harder to inline, branch prediction is more difficult, cacheing might be an issue as the vtbl ptr itself adds 8 bytes to the size of a (maybe small) object, etc). So I was just trying to state the obvious, that strictly speaking a virtual call is more expensive than a regular call, it's only a question of how it compares to alternatives. I agree the cost of virtual calls often isn't significant, hence I recommend profiling. – user823981 Apr 10 '14 at 16:04
1

@Nemo Switches can be faster than virtual function calls. When using Profile guided optimisations the VC++ compiler has been adding switch statements before virtual calls since 2005. They call it Virtual Call Speculation and it can be faster because the cpu can predict the switch but not the virtual call. http://blogs.msdn.com/b/vcblog/archive/2008/11/12/pogo.aspx – user1937198 Apr 10 '14 at 17:31
@OpenLearner See test results in my answer. – Jason C Apr 10 '14 at 18:19
2

@user1937198: Intel CPUs have had "indirect branch predictors" since at least 2003 (http://stackoverflow.com/a/7263886/768469), and they get better with every generation. So I stand by my claim: Converting virtual calls to switch statements as the OP here is suggesting will almost certainly result in _slower_ code; or at best, no faster. – Nemo Apr 11 '14 at 01:00

Virtual function performance: one large class vs many smaller subclasses

3 Answers3