Are there any rules of thumb when `virtual` is a considerable overhead?

Question

My questions is basically completely stated in the title, however let me elaborate.

Question: Maybe worth of rephrasing, how complicated/simple the virtual method has to be, to make the mechanism a considerable overhead? Are there any rules of thumb for this? E.g. If it takes 10 minutes, uses I/O, complex if statements, memory operations etc. it's not a problem. Or, if you write virtual get_r() { return sqrt( x*x + y*y); }; and call it in a loop you will have troubles.

I hope the question is not too general as I seek some general but concrete technical answers. Either its hard/impossible to tell, or virtual calls take so much time/cycles resources, and math takes this, I/O this.

Maybe some technical people know some general numbers to compare or did some analysis and can share general conclusions. Embarassingly I dunno how to make those fancy asm analysis =/.

I would also like to give some rationale behind it, as well as my use-case.

I think I saw more than few questions with people refraining from using virtuals like open fire in the forest during drought, for the sake of performance, and as many individuals asking them "Are you absolutely sure that virtual overhead is really an issue in your case?".

In my recent work I ran into a problem which can be placed at both sides of the river, I believe.

Also bear in mind I do not ask how to improve implementation of interface. I believe I know how to do it. I'm asking if it's possible to tell when to do it, or which to choose right of the bat.

Use-case:

I run some simulations. I have a class which basically provides a run environment. There is a base class, and more than one derived class that define some different workflows. Base collects stuff as common logic and assigning I/O sources and sinks. Derivatives define particular workflows, more or less by implementing RunEnv::run(). I think this is a valid design. Now let's imagine objects that are subjects of the workflow can be put in 2D or 3D plane. The workflows are common/interchangeable in both cases, so the objects we are working on can have common interface, although to very simple methods like Object::get_r(). On top of that lets have some stat logger defined for the environment.

Originally I wanted to provide some code snippets but it ended up with 5 classes and 2-4 methods each i.e. wall of code. I can post it on request but it would lengthen the question to the twice of current size.

Key points are: RunEnv::run() is the main loop. Usually very long (5mins-5h). It provides basic time instrumentation, calls RunEnv::process_iteration() and RunEnv::log_stats(). All are virtual. Rationale is. I can derive the RunEnv, redesign the run() for example for different stop conditions. I can redesign process_iteration(), for example to use multi-threading if I have to process a pool of objects, process them in various ways. Also different workflows will want to log different stats. RunEnv::log_stats() is just a call that outputs already computed interesting stats into a std::ostream. I guess using virtuals and has no real impact.

Now let's say the iteration works by calculating distance of objects to the origin. So we have as interface double Obj::get_r();. Obj are implementation for 2D and 3D cases. The getter is in both cases a simple math with 2-3 multiplications and additions.

I also experimented in different memory handling. E.g. sometimes coordinate data was stored in private variables and sometimes in shared pool, so even the get_x() could be made virtual with implementation get_x(){return x;}; or get_x(){ return pool[my_num*dim+x_offset]; };. Imagine calculating something with get_r(){ sqrt(get_x()*get_x() + get_y()*get_y()) ;};. I suspect virtuality here would kill performance.

`sqrt()` is already expensive enough where you probably won't feel the virtual call. — Mysticial, Jul 08 '13 at 17:08
@Mysticial That is kind of what I am looking for. I never saw an answer saying how much on an actual overhead is the vtable look-up versus the typical tasks such as `sqrt()`, O(1) int/float math, short for-loop, simple I/O etc. — luk32, Jul 08 '13 at 17:13
Since you're after performance, is it possible to just not need the square root? If you're comparing distances or if you're putting them into equations that will just square it again (i.e gravitational/electromagnetic), then you can avoid the square root completely. — Mysticial, Jul 08 '13 at 17:17
@Mysticial I know. I do not want to make the question about the underlying problem and its implementation. I just worked on something similar so it was easier to make up an example using this. — luk32, Jul 08 '13 at 17:20
@Mysticial On most modern machines, `sqrt` isn't really any more expensive than division. (Of course, floating point division isn't the fastest operation around either.) And because it uses different parts of the processor than the call or the return does, it can often be run in parallel. On the other hand, the cost of virtuality varies enormously from one processor to the next, so you can't really make any concrete statements one way or the other. — James Kanze, Jul 08 '13 at 17:29
@JamesKanze On all the mainstream desktop processors that I work with, a double-precision `sqrt()` takes upwards of 20 cycles and a memory load (an indirection) is on the order of 3 cycles. Yeah, you can probably hide some of that `sqrt()` latency, but some systems it shares resources with the FP multiplier and adder. So if you've got those running in the background... In any case, I've never seen a situation where the latency of a sqrt or a division could be completely hidden - save for a cache miss, but that's a different story. — Mysticial, Jul 08 '13 at 17:38
@Mysticial I've had a number of cases where the latency of a division was completely invisible. But it all depends. The real point is that given today's processors, you can't say much without measuring _in the actual context the function is being used_. (Your remark about latency is very a propos in that regard.) — James Kanze, Jul 08 '13 at 17:43
Related: http://stackoverflow.com/questions/7241922/how-has-cpu-architecture-evolution-affected-virtual-function-call-performance — Matteo Italia, Jul 08 '13 at 17:44
@JamesKanze What were you dividing by? Divisions are a lot faster if the divisor has a short bit-length. But yes, your point still holds. I haven't seen much outside of modern desktop processors. — Mysticial, Jul 08 '13 at 17:45
@Mysticial Good question. I don't remember (but it was a floating point division, and not a constant). — James Kanze, Jul 08 '13 at 17:51

Sergey K. · Accepted Answer · 2013-07-08T17:20:09.430

8

The virtual method call in C++ on an x86 yield the code similar to (single inheritance):

    mov ecx,[esp+4]
    mov eax,[ecx]       // pointer to vtable
    jmp [eax]

Without virtual you will spare one mov instruction compared to a non-virtual member function. So, in case of single inheritance the performance hit is negligible.

In case if you have multiple inheritance or, worse, virtual inheritance the virtual calls can be much much more complex. But this is more problem of classes hierarchy and architecture.

The rule of thumb:

If the body of the method is many times (>100x) slower than a single mov instruction - just use virtual and don't bother. Otherwise - profile your bottlenecks and optimize.

Update:

For multiple/virtual inheritance cases check out this page: http://www.lrdev.com/lr/c/virtual.html

edited Jul 08 '13 at 17:20

answered Jul 08 '13 at 17:12

Sergey K.

24,894
13
106
174

Ok, could you maybe please elaborate. Let's assume I make `get_r()` virtual, for 2/3D interfaces, and then I implemented the other memory layout. So the the virtual `get_r()` calls a virtual `get_x()`. I think this is still at least semi-real-world example. – luk32 Jul 08 '13 at 17:16
If you have hierarchy of classes with ``get_r()`` method you will end up making it polymorphic anyway. So, why bother? If it is only a single class or ``get_r()`` is not polymorphic - make it ``inline``. – Sergey K. Jul 08 '13 at 17:17
You said that multiple inheritance will make things complicated. I tried to make up an example that will have 2 levels of inheritance. I would also like to know some number like how many instructions does cout << "Hello"; has. Etc. or how to check it, so i can post it. Just for comparison to the 100x mov. – luk32 Jul 08 '13 at 17:23
4

On superscalar architectures the problem is not the `mov`, but the `jmp` to a non-fixed address, that introduces a considerable slowdown if the branch predictor hasn't got any clue about it yet or if it makes a wrong guess. – Matteo Italia Jul 08 '13 at 17:32
2

Anyway, in situations where performance matters (usually inner loops) if the type of the object doesn't change every moment the branch predictor should pick up quickly, thus removing the branch mispredictions penalty; also, the compiler is usually smart enough to avoid virtual dispatch in many occasions where it can determine without doubt the type of the object at compile time (=you pay the penalty of `virtual` only when `virtual` dispatch is actually needed). – Matteo Italia Jul 08 '13 at 17:40
@SergeyK. Since you gave most technical answer. I would like to add a little more question to make it complete for me... So it is safe to assume `virtual` overhead is even lower than `return tab[pos];` in case of cache miss, or even in general? Maybe it's also comparable to something like `for(size_t i=0; i<10; ++i) x*=x;`. If you can say its faster than cache miss or simple conditional loop, or simplest I/O. I am content. – luk32 Jul 08 '13 at 17:47
In the general case it can't be faster than a cache miss, because it could involve a cache miss. Depends how predictable it is, how much the called function trashes the cache, etc... If the call site is generally monomorphic, it should be pretty predictable. If it's heavily polymorphic, not so much. – Catfish_Man Jul 08 '13 at 18:28
@Catfish_Man Umm... i made an implicit assumption that if it gets called often it should not miss too much. If it doesn't it will not pose real performance issue. Also, I belive that if the called function trashes cache much I would expect it to be complicated enough to be much more expensive than the overhead imposed by virtual. – luk32 Jul 08 '13 at 18:42
That does seem likely, especially given how large caches are these days – Catfish_Man Jul 08 '13 at 20:24

score 8 · Answer 2 · answered Jul 08 '13 at 17:16

Are there any rules of thumb for this?

The best, most general rule of thumb for questions like this one is:

measure your code before optimizing

Trying to make your code perform well without measuring is a sure path to unnecessarily complex code that's optimized in all the wrong places.

So, don't worry about the overhead of a virtual function until you have some solid evidence that the virtual is the problem. If you do have such evidence, then you can work to remove the virtual in that case. More likely, though, you'll find that finding ways to speed up your calculations, or to avoid calculating where you don't need to, will yield much larger performance improvements. But again, don't just guess -- measure first.

Haha, of course I know "the don't bother before you measure". The thing is I am not very familiar with measruing such overheads. Of course that is the matter for other question. I was just wondering if there are rules that make the decision process simpler. I wanted to make the question more about the design rather than optimizing already written code, and I can't measure code i have not written yet! I am asking experts who wrote and measured if they have some general conclusions. Nonetheless, I understand write virtual then measure approach. — luk32, Jul 08 '13 at 17:35

James Kanze · Answer 3 · 2013-07-08T17:40:44.750

3

First, of course, any difference will depend on the compiler, the architecture, etc. On some machines, the difference between a virtual call and a non-virtual one will hardly be measurable, on at least one other, it will (or would---my experience with this machine is rather ancient) completely purge the pipeline (no branch prediction for indirect jumps).

On most processors, the real cost of virtual functions is the loss of the ability to inline, with the resulting loss of other optimization possibilities. In other words, the cost will in fact depend on the context in which the function is called.

More to the point, however: virtual functions and non-virtual functions have different semantics. So you can't choose: if you need virtual semantics, you have to use virtual; if you don't need virtual semantics, you can't use virtual. So the question really doesn't come up.

edited Jul 08 '13 at 17:40

answered Jul 08 '13 at 17:24

James Kanze

150,581
18
184
329

Can you, please, elaborate on ``virtual/non-virtual one will hardly be measurable`` with more technical details and without rephrasing what had been already said in the first answer? – Sergey K. Jul 08 '13 at 17:28
Well I saw interfaces implemented with ABCs as well as templates, just to avoid virtuals. I believe you can design your code to choose whether to use or to avoid virtuals, at least in some places. – luk32 Jul 08 '13 at 17:28
For non-x86 this can be an issue. Profiling is required anyway. – Sergey K. Jul 08 '13 at 17:30
What other details do you mean? Conceptually, there could even be hardware where there was no difference. On a Sun Sparc, depending on the case, there would often be about one clock difference: less than the time needed to set up the local stack frame. Given the cost of passing parameters and setting up the local stack frame on an Intel, I doubt that you could measure the difference either, even if the function was empty. – James Kanze Jul 08 '13 at 17:39

score 1 · Answer 4 · answered Jul 08 '13 at 19:04

The absolute most basic recommendation, which again as others have stated is something you should profile in your specific application and environment, is to avoid virtual in tight loops.

Note that if you actually need polymorphic behavior, virtual member functions will probably behave better than most alternatives. The exception can be when you have a collection of polymorphic but homogenous types (the collection could be any of the polymorphic types, but they're all the same type, whichever type they happen to be). You're then better off with moving the polymorphic behaviour outside of the loop.

Taking the classic dumb bad-OO example using shapes, you're better off with:

// "fast" way
struct Shape {
  virtual void DrawAll(Collection) = 0;
};

struct Rectangle : public Shape {
  virtual void DrawAll(Collection collection) {
    for (const auto& rect : collection)
      do_rectangle_draw();
  }
};

struct Circle : public Shape {
  virtual void DrawAll(Collection collection) {
    for (const auto& circle : collection)
      do_circle_draw();
  }
};

Than the more naive version which might be something like:

// "slow" way
struct Shape {
  virtual void DrawSelf() = 0;

  void DrawAll(Collection collection) {
    for (const auto& shape : collection)
      shape.DrawSelf(); // virtual invocation for every item in the collection!
  }
};

Again, this only works with homogeneous types in collections. If your Collection could contain both Rectangles and Circles, then you're going to need to discriminate on each instance during iteration as to which drawing method to use. A virtual function will probably be faster than a function pointer or a switch statement (but profile to be sure).

The goal with the above code was to move the polymorphic behaviour out of the loop. This is not always possible, but when it is, it will usually amount to some level of performance win. For a large number of objects (say, a particle simulator) the performance difference could be quite noticeable.

If a function is not called many thousands of times inside a loop, you probably won't notice any measurable difference between virtual and non-virtual functions. But profile it to test and be sure if you think it matters.

Are there any rules of thumb when `virtual` is a considerable overhead?

4 Answers4