Virtual Functions and Performance C++

Question

Before you cringe at the duplicate title, the other question wasn't suited to what I ask here (IMO). So.

I am really wanting to use virtual functions in my application to make things a hundred times easier (isn't that what OOP is all about ;)). But I read somewhere they came at a performance cost, seeing nothing but the same old contrived hype of premature optimization, I decided to give it a quick whirl in a small benchmark test using:

CProfiler.cpp

#include "CProfiler.h"

CProfiler::CProfiler(void (*func)(void), unsigned int iterations) {
    gettimeofday(&a, 0);
    for (;iterations > 0; iterations --) {
        func();
    }
    gettimeofday(&b, 0);
    result = (b.tv_sec * (unsigned int)1e6 + b.tv_usec) - (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
};

main.cpp

#include "CProfiler.h"

#include <iostream>

class CC {
  protected:
    int width, height, area;
  };

class VCC {
  protected:
    int width, height, area;
  public:
    virtual void set_area () {}
  };

class CS: public CC {
  public:
    void set_area () { area = width * height; }
  };

class VCS: public VCC {
  public:
    void set_area () {  area = width * height; }
  };

void profileNonVirtual() {
    CS *abc = new CS;
    abc->set_area();
    delete abc;
}

void profileVirtual() {
    VCS *abc = new VCS;
    abc->set_area();
    delete abc;
}

int main() {
    int iterations = 5000;
    CProfiler prf2(&profileNonVirtual, iterations);
    CProfiler prf(&profileVirtual, iterations);

    std::cout << prf.result;
    std::cout << "\n";
    std::cout << prf2.result;

    return 0;
}

At first I only did 100 and 10000 iterations, and the results were worrying: 4ms for non virtualised, and 250ms for the virtualised! I almost went "nooooooo" inside, but then I upped the iterations to around 500,000; to see the results become almost completely identical (maybe 5% slower without optimization flags enabled).

My question is, why was there such a significant change with a low amount of iterations compared to high amount? Was it purely because the virtual functions are hot in cache at that many iterations?

Disclaimer
I understand that my 'profiling' code is not perfect, but it, as it has, gives an estimate of things, which is all that matters at this level. Also I am asking these questions to learn, not to solely optimize my application.

"(maybe 5% slower without optimization flags enabled)" -- this implies that you are profiling a debug/non-optimized build. Doing so is going to yield a benchmark that is usually far to flawed to be useful. Is this the case? — , Feb 02 '11 at 08:01
IT was down on Ubuntu 10.10, using g++, with _and_ without optimization flags. — hiddensunset4, Feb 02 '11 at 08:07
Surely you're compiling with optimizations enabled to do benchmark testing, right? — Cody Gray - on strike, Feb 02 '11 at 08:13
@Daniel you should remove new/delete; preallocate objects outside profiling loop, make true virtual call (i.e. ptr2base->set_area()), then run enough iterations of the test (a few seconds, at least) then add the new results to your question with appropriate comments:) — user396672, Feb 02 '11 at 12:10
I've already done these tests, they are in the below comments, but I will add the results soon :). — hiddensunset4, Feb 04 '11 at 06:16
yes, the 5% difference might be because allocation takes roughly 80% of function time, so the remaining 20% is actual function call; so the 5% difference might actually be 25% — lurscher, Jul 17 '11 at 14:54
also, you are calling VCS directly, instead of accessing through the virtual pointer table of VCC — lurscher, Jul 17 '11 at 15:01
Generally, start at [JTC1.22.18015 ISO/IEC TR 18015 - C++ Performance](http://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf) — Cheers and hth. - Alf, Feb 02 '11 at 08:07

score 10 · Answer 1 · answered Feb 02 '11 at 08:04

10

I believe that your test case is too artificial to be of any great value.

First, inside your profiled function you dynamically allocate and deallocate an object as well as call a function, if you want to profile just the function call then you should do just that.

Second, you are not profiling a case where a virtual function call represents a viable alternative to a given problem. A virtual function call provides dynamic dispatch. You should try profiling a case such as where a virtual function call is used as an alternative to something using a switch-on-type anti-pattern.

answered Feb 02 '11 at 08:04

CB Bailey

755,051
104
632
656

So you are saying I should try the same tests on a more complex (virtualised) inheritance structure? It also must be noted this is not my actual application, it is purely a test of functionality. I have no aversion to trying different examples. – hiddensunset4 Feb 02 '11 at 08:13
1

+1: Note. Just want to put extra stressing the just do the call bit. Do not even put the object on the stack as then you are also timing the cost of construction of the object each time (create it once). – Martin York Feb 02 '11 at 08:20
+1: (If I could). Comparing apples and oranges is not useful. Compare virtual dispatch against the alternative (making a call based on the type of the object decided at runtime). – Martin York Feb 02 '11 at 08:21
I believe that Charles is telling you to write code with and without virtual methods in a case where a dynamic dispatch problem is solved. – daramarak Feb 02 '11 at 08:22
Similar to the following? http://www.pastie.org/1520810 The results were the similar enough, see my comment on "Daniel Gehriger"s response below. – hiddensunset4 Feb 02 '11 at 08:27
I also find it funny no one has pointed out the undefined behavior of using set_area without variable initializations yet :P – hiddensunset4 Feb 02 '11 at 08:29
1

No not really, you are not accomplishing dynamic dispatch with this code. Dynamic dispatch allows different functions called for different classes. Your need inheritance on both sides, and type checking and acting upon the type on the non-virtual side. – daramarak Feb 02 '11 at 08:35
Would you be able to provide an example of that? If it doesn't completely take up your time :) – hiddensunset4 Feb 02 '11 at 09:18

score 5 · Accepted Answer · edited May 23 '17 at 11:58

5

Extending Charles' answer.

The problem here is that your loop is doing more than just testing the virtual call itself (the memory allocation probably dwarfs the virtual call overhead anyway), so his suggestion is to change the code so that only the virtual call is tested.

Here the benchmark function is template, because template may be inlined while call through function pointers are unlikely to.

template <typename Type>
double benchmark(Type const& t, size_t iterations)
{
  timeval a, b;
  gettimeofday(&a, 0);
  for (;iterations > 0; --iterations) {
    t.getArea();
  }
  gettimeofday(&b, 0);
  return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
         (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}

Classes:

struct Regular
{
  Regular(size_t w, size_t h): _width(w), _height(h) {}

  size_t getArea() const;

  size_t _width;
  size_t _height;
};

// The following line in another translation unit
// to avoid inlining
size_t Regular::getArea() const { return _width * _height; }

struct Base
{
  Base(size_t w, size_t h): _width(w), _height(h) {}

  virtual size_t getArea() const = 0;

  size_t _width;
  size_t _height;
};

struct Derived: Base
{
  Derived(size_t w, size_t h): Base(w, h) {}

  virtual size_t getArea() const;
};

// The following two functions in another translation unit
// to avoid inlining
size_t Derived::getArea() const  { return _width * _height; }

std::auto_ptr<Base> generateDerived()
{
  return std::auto_ptr<Base>(new Derived(3,7));
}

And the measuring:

int main(int argc, char* argv[])
{
  if (argc != 2) {
    std::cerr << "Usage: %prog iterations\n";
    return 1;
  }

  Regular regular(3, 7);
  std::auto_ptr<Base> derived = generateDerived();

  double regTime = benchmark<Regular>(regular, atoi(argv[1]));
  double derTime = benchmark<Base>(*derived, atoi(argv[1]));

  std::cout << "Regular: " << regTime << "\nDerived: " << derTime << "\n";

  return 0;
}

Note: this tests the overhead of a virtual call in comparison to a regular function. The functionality is different (since you do not have runtime dispatch in the second case), but it's therefore a worst-case overhead.

EDIT:

Results of the run (gcc.3.4.2, -O2, SLES10 quadcore server) note: with the functions definitions in another translation unit, to prevent inlining

> ./test 5000000
Regular: 17041
Derived: 17194

Not really convincing.

edited May 23 '17 at 11:58

Community

1
1

answered Feb 02 '11 at 08:47

Matthieu M.

287,565
48
449
722

1

I appreciate the answer; but the results make no sense, and it took 5 minutes just to remove syntax errors. Your code (fixed) here http://www.pastie.org/1520895, let me know if the needed changes were breaking or not. The results were Regular: 1 Derived: 12548 for 5,000,000 iterations. – hiddensunset4 Feb 02 '11 at 09:13
@Daniel: I've backported the 3 fixes, thanks. Your results seem strange (to say the least), I'll see if I can find some time to run this. – Matthieu M. Feb 02 '11 at 10:13
1

@Daniel: I've taken the time to actually compile and run the code. Taking care, as I had noted in the comments, to put the functions definitions in another .cpp file. I've posted the results, and as you can see... there is not much difference, at all. The explanation is likely that caching works on my machine. – Matthieu M. Feb 02 '11 at 10:27
I'm assuming 'another translation unit' means another file? Sorry :) Also, all things said and done, the results look identical to my results from the OP. :P – hiddensunset4 Feb 02 '11 at 10:31
@Daniel: yes, another ".cpp". Otherwise the compiler might simply forgo the call to the function and replace it "in place" with its body (this is called inlining, very similar to a macro but with type safety). A compiler may also detect that a `Derived` is used, even though the call emanate from a reference to `Base`, and once again use a direct call (rather than a virtual one) or even inline the call. By using another source file we hide this information from it, and unless it uses Link-Time Optimization, it cannot actually optimize those calls out. – Matthieu M. Feb 02 '11 at 10:34
@Daniel: indeed they look similar, but I had less variance with less iterations. From 1 to 5,000,000 it yields nearly identical values each time. – Matthieu M. Feb 02 '11 at 10:37
To save me re-setting up the code, what your results for very low iterations (in the order of 1000). – hiddensunset4 Feb 02 '11 at 10:50
@Daniel: `> ./test 1000` yields `Regular: 5`, `Derived: 5` indistinguishable. – Matthieu M. Feb 02 '11 at 11:01
Thanks Matthew, for giving the best answer :) – hiddensunset4 Feb 02 '11 at 11:08
@Daniel: Actually I do believe that Charles' is better, I merely offered a "code-oriented" view of his. – Matthieu M. Feb 02 '11 at 13:04
@Matthieu M.: I saw a remark in some other question that moving to .cpp doesn't always prevent inlining - in MSVC there is Link-time Code Generation, and I suspect that gcc could do something similar too. Compilers are already crazy enough to do inlines at the link stage... The only real way to avoid this - use a dynamic library >_ – Roman L Feb 02 '11 at 17:15
@7vies: yes, they can. It is (in general) called Link-Time Optimization. Clang enables it in `-O4`, I don't know for MSVC or gcc which flags are controlling it. – Matthieu M. Feb 02 '11 at 17:42
@Matthieu M.: MSVC doesn't seem to call it that way, so it is not "in general" but "in not-MSVC-general" :) Anyway, I just commented on your remark "with the functions definitions in another translation unit, to prevent inlining" - moving to .cpp might not be enough to prevent inlining. – Roman L Feb 02 '11 at 17:49
@7vies: I have understood, I just prefer not to mention it in the response because things get hairy afterward... After all you could imagine a "smart" library loader that would inline calls too ! Link-Time Optimization refer to the concept, I must admit I don't know the specific branding that each compiler used to promote itself. – Matthieu M. Feb 02 '11 at 17:57
I understand that you were merely extending Charles answer, but you presented your answer (incl. code) in a way that I better understood, and hence was more clear; even though it was a meant to be merely a complement to Charles answer. – hiddensunset4 Feb 02 '11 at 21:30

score 2 · Answer 3 · answered Feb 02 '11 at 08:02

2

With a small number of iterations there's a chance that your code is preempted with some other program running in parallel or swapping occurs or anything else operating system isolates your program from happens and you'll have the time it was suspended by the operating system included into your benchmark results. This is number one reason why you should run your code something like a dozen million times to measure anything more or less reliably.

answered Feb 02 '11 at 08:02

sharptooth

167,383
100
513
979

1

Yet those results were consistent, I didn't exactly just try it 'once'. – hiddensunset4 Feb 02 '11 at 08:05
@Daniel: It can be whatever thing you don't control. Trying to meaningfully analyze it is a waste of time. You'll be much better off if you treat it as noise. – sharptooth Feb 02 '11 at 08:05

score 2 · Answer 4 · answered Feb 02 '11 at 08:08

There is a performance impact to calling a virtual function, because it does slightly more than calling a regular function. However, the impact is likely to be completely negligible in a real-world application -- even less so than appear in even the most finely crafted benchmarks.

In a real world application, the alternative to a virtual function is usually going to involve you hand-writing some similar system anyhow, because the behavior of calling a virtual function and calling a non-virtual function differs -- the former changes based on the runtime type of the invoking object. Your benchmark, even disregarding whatever flaws it has, doesn't measure equivalent behavior, only equivalent-ish syntax. If you were to institute a coding policy banning virtual functions you'd either have to write some potentially very roundabout or confusing code (which might be slower) or re-implement a similar kind of runtime dispatch system that the compiler is using to implement virtual function behavior (which is certainly going to be no faster than what the compiler does, in most cases).

Thanks for the answer, but the question wasn't whether there was a cost, but why the difference at low iterations vs high. A *consistent* difference. — hiddensunset4, Feb 02 '11 at 08:37

score 1 · Answer 5 · answered Feb 02 '11 at 08:03

I think that this kind of testing is pretty useless, in fact:
1) you are wasting time for profiling itself invoking gettimeofday();
2) you are not really testing virtual functions, and IMHO this is the worst thing.

Why? Because you use virtual functions to avoid writing things such as:

<pseudocode>
switch typeof(object) {

case ClassA: functionA(object);

case ClassB: functionB(object);

case ClassC: functionC(object);
}
</pseudocode>

in this code, you miss the "if... else" block so you don't really get the advantage of virtual functions. This is a scenario where they are always "loser" against non-virtual.

To do a proper profiling, I think you should add something like the code I've posted.

Daniel Gehriger · Answer 6 · 2011-02-02T08:12:47.847

1

There could be several reasons for the difference in time.

your timing function isn't precise enough
the heap manager may influence the result, because sizeof(VCS) > sizeof(VS). What happens if you move the new / delete out of the loop?
Again, due to size differences, memory cache may indeed be part of the difference in time.

BUT: you should really compare similar functionality. When using virtual functions, you do so for a reason, which is calling a different member function dependent on the object's identity. If you need this functionality, and don't want to use virtual functions, you would have to implement it manually, be it using a function table or even a switch statement. This comes at a cost, too, and that's what you should compare against virtual functions.

edited Feb 02 '11 at 08:12

answered Feb 02 '11 at 08:04

Daniel Gehriger

7,339
2
34
55

I used the new in there test both functionality, the results were similar when comparing just the functions or just the allocation. Ie over 5,000,000 iterations, on average the virtualised tests were 9% slower (tested 10 times and averaged, so really 50,000,000). – hiddensunset4 Feb 02 '11 at 08:23
Yes, but you aren't allocating the same amount of memory. So anything may happen due to the heap manager. – Daniel Gehriger Feb 02 '11 at 08:31
How am I allocating memory? The declaration was on the stack, it was purely the function calls... (note my wording: "the results were similar when comparing just the functions or just the allocation"). – hiddensunset4 Feb 02 '11 at 09:17
In `CS *abc = new CS;`, you are allocating `CS` on the heap. – Daniel Gehriger Feb 02 '11 at 09:21
And note my comment (again): "when comparing just the functions". I did separate tests. Check my comments to Charles Bailey's response. – hiddensunset4 Feb 02 '11 at 09:44
You code is full of functions, so your comment could mean anything... Now that I know that you since modified your code, it's clear. For me, it's about precision in timing. Maybe branch prediction is also part of it. But the real point is still that you are comparing dynamic dispatch against static dispatch. Implement dynamic dispatch manually (w/o virtual functions), and then compare the result. – Daniel Gehriger Feb 02 '11 at 09:50

score 0 · Answer 7 · answered Oct 04 '15 at 03:34

In my opinion, When there was less number of loops, may be there was no context switching, But when you increased the number of loops, then there are very strong chances that context switching takes place and that is dominating the reading. For example first program takes 1 sec and second program 3 secs, but if context switch takes 10 secs, then the difference is 13/11 instead of 3/1.

score 0 · Answer 8 · answered Feb 02 '11 at 08:04

When using too few iterations, there is a lot of noise in the measurement. The gettimeofday function is not going to be accurate enough to give you good measurements for only a handful of iterations, not to mention that it records total wall time (which includes time spent when preempted by other threads).

Bottom line, though, you shouldn't come up with some ridiculously convoluted design to avoid virtual functions. They really don't add much overhead. If you have incredibly performance critical code and you know that virtual functions make up most of the time, then perhaps it's something to worry about. In any practical application, though, virtual functions won't be what's making your application slow.

Thanks for a modest answer, remember as I said, this was not to optimize my application, it was purely out of interest, I only worry about application level optimization problems when using Valgrind and Gprof point them out. — hiddensunset4, Feb 02 '11 at 08:15

Virtual Functions and Performance C++

8 Answers8

Linked

Related