Why am I observing multiple inheritance to be faster than single?

Question

I have the following two files :-

single.cpp :-

#include <iostream>
#include <stdlib.h>

using namespace std;

unsigned long a=0;

class A {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; } 
};

class B : public A {                                                                              
  public:                                                                                                                                                                        
    virtual int f() __attribute__ ((noinline)) { return a; }                                      
    void g() __attribute__ ((noinline)) { return; }                                               
};                                                                                                

int main() {                                                                                      
  cin>>a;                                                                                         
  A* obj;                                                                                         
  if (a>3)                                                                                        
    obj = new B();
  else
    obj = new A();                                                                                

  unsigned long result=0;                                                                         

  for (int i=0; i<65535; i++) {                                                                   
    for (int j=0; j<65535; j++) {                                                                 
      result+=obj->f();                                                                           
    }                                                                                             
  }                                                                                               

  cout<<result<<"\n";                                                                             
}

And

multiple.cpp :-

#include <iostream>
#include <stdlib.h>

using namespace std;

unsigned long a=0;

class A {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; }
};

class dummy {
  public:
    virtual void g() __attribute__ ((noinline)) { return; }
};

class B : public A, public dummy {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; }
    virtual void g() __attribute__ ((noinline)) { return; }
};


int main() {
  cin>>a;
  A* obj;
  if (a>3)
    obj = new B();
  else
    obj = new A();

  unsigned long result=0;

  for (int i=0; i<65535; i++) {
    for (int j=0; j<65535; j++) {
      result+=obj->f();
    }
  }

  cout<<result<<"\n";
}

I am using gcc version 3.4.6 with flags -O2

And this is the timings results I get :-

multiple :-

real    0m8.635s
user    0m8.608s
sys 0m0.003s

single :-

real    0m10.072s
user    0m10.045s
sys 0m0.001s

On the other hand, if in multiple.cpp I invert the order of class derivation thus :-

class B : public dummy, public A {

Then I get the following timings (which is slightly slower than that for single inheritance as one might expect thanks to 'thunk' adjustments to the this pointer that the code would need to do) :-

real    0m11.516s
user    0m11.479s
sys 0m0.002s

Any idea why this may be happening? There doesn't seem to be any difference in the assembly generated for all three cases as far as the loop is concerned. Is there some other place that I need to look at?

Also, I have bound the process to a specific cpu core and I am running it on a real-time priority with SCHED_RR.

EDIT:- This was noticed by Mysticial and reproduced by me. Doing a

cout << "vtable: " << *(void**)obj << endl;

just before the loop in single.cpp leads to single also being as fast as multiple clocking in at 8.4 s just like public A, public dummy.

I wouldn't expect speed of integer arithmetic to depend on the values (floating-point certainly does), but set `obj->a` to a consistent value just to make sure. — Ben Voigt, May 03 '12 at 20:50
Setting it to 5. Taking that as input but yes, it's 5 for all the run-cases. But as you yourself point out, it shouldn't matter. — owagh, May 03 '12 at 20:52
When do you create an instance of object A? I only see a new instance of B not A. — andre, May 03 '12 at 20:54
@owagh: I'm not talking about the value of the global, which is read from the console. I'm talking about `obj->a`, which is never assigned and therefore is indeterminate. — Ben Voigt, May 03 '12 at 20:55
There is no a in any local scope. So it should (and does) return the global. — owagh, May 03 '12 at 20:56
I'm not able to reproduce these numbers on a Core i7 with x86 Ubuntu VM with GCC 4.4.3. What CPU are you using? x86 or x64? — Mysticial, May 03 '12 at 20:58
@owagh: Or wait, you're using *value initialization*, aren't you. So members should be zeroed. But `B::f()` certainly does not return the global. — Ben Voigt, May 03 '12 at 21:01
@Mysticial I'll try it on another machine with a different compiler when I get home from work and post the results. — owagh, May 03 '12 at 21:03
@owagh I have a whole bunch of machines with different OS I can try this on. I'll see what I find. — Mysticial, May 03 '12 at 21:04
Alright. Reproduced on Core i7 920 Windows 7 - VS2010 x64. 7.4 vs. 8.7 seconds. — Mysticial, May 03 '12 at 21:05
@Mysticial I'm guessing you mean 7.4 for multiple vs 8.7 for single? — owagh, May 03 '12 at 21:08
@ Ben Voigt Sorry about that. I was just testing something. The code I first noticed this with didn't have int a,b,c; in the classes. Anyways, I'll update the code but the timings are still the same. — owagh, May 03 '12 at 21:18
reproduced on g++4.2.1, osx 10.7.3, i5; however, the single takes 7sec, multi takes 9sec and after changing ordering of base classes -- 16sec! — tomasz, May 03 '12 at 21:34
@tomasz I'm not sure that you've reproduced it precisely. I'm observing single to be *slower* than multi. And yes your results are exactly what I'd expect. Single is fastest, then one version of multi with class A coming first and then the other one with class dummy coming first being the slowest — owagh, May 03 '12 at 21:44
@owagh not sure why I should expect twice the running time while putting `dummy` in front of `A` (I don't know much about implementing vtable though). Anyway, I have a difference in assembly! My `f` has some kind of jmp into another function (with a comment TAILCALL) and it makes it twice as long. I'm scared. — tomasz, May 03 '12 at 21:53
@tomasz Not twice. Maybe 7,9,9 would be expected... Hence I said that it's very interesting. — owagh, May 03 '12 at 21:55
There's definitely something very funny going on with GCC. Printing out the vtable causes Single to run as fast as Multiple... uh... — Mysticial, May 03 '12 at 22:19
`cout << "vtable: " << *(void**)obj << endl;` If you put it before the loops, it makes Single run as fast as Multiple. It looks like it's messing with the register allocator in a way that eliminates that slow down... wtf... But the slowdown has to be data/address-dependent because the assembly for the loops is identical in both Single and Mulitiple. — Mysticial, May 03 '12 at 22:26
@Mysticial Yes! I see exactly the same thing. Maybe the cpu doesn't explicity load the v-table into it's instruction cache for the single case unless you do the cout? (Just throwing whatever random ideas come to mind) — owagh, May 03 '12 at 22:33

score 27 · Answer 1 · edited May 23 '17 at 10:30

Note, this answer is highly speculative.

Unlike some of my other answers to questions of the type "Why is X slower than Y", I've been unable to provide solid evidence to backup this answer.

After tinkering with this for about an hour now, I think it's due to the address alignment of three things:

The address of obj
The address of the Virtual Method Table of A
The address of function f()

(owagh's answer also hints at the possibility of instruction alignment.)

The reason why multiple inheritance is slower than the single inheritance is not because it is "magically" fast, but because the single inheritance case is running into either a compiler or a hardware "hiccup".

If you dump out the assembly for the single and multiple inheritance cases, they are identical (register names and everything) within the nested loop.

Here's the code I compiled:

#include <iostream>
#include <stdlib.h>
#include <time.h>
using namespace std;
unsigned long a=0;


#ifdef SINGLE
class A {
  public:
    virtual int f() { return a; } 
};

class B : public A {
  public:
    virtual int f() { return a; }                                      
    void g() { return; }                                               
};       
#endif

#ifdef MULTIPLE
class A {
  public:
    virtual int f() { return a; }
};

class dummy {
  public:
    virtual void g() { return; }
};

class B : public A, public dummy {
  public:
    virtual int f() { return a; }
    virtual void g() { return; }
};
#endif

int main() {
    cin >> a;
    A* obj;
    if (a > 3)
        obj = new B();
    else
        obj = new A();

    unsigned long result = 0;


    clock_t time0 = clock();

    for (int i=0; i<65535; i++) {
        for (int j=0; j<65535; j++) {
            result += obj->f();
        }
    }      

    clock_t time1 = clock();   
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;    

    cout << result << "\n";
    system("pause");  //  This is useless in Linux, but I left it here for a reason.
}

The assembly for the nested loop is identical in both single and multiple inheritance cases:

.L5:
    call    clock
    movl    $65535, %r13d
    movq    %rax, %r14
    xorl    %r12d, %r12d
    .p2align 4,,10
    .p2align 3
.L6:
    movl    $65535, %ebx
    .p2align 4,,10
    .p2align 3
.L7:
    movq    0(%rbp), %rax
    movq    %rbp, %rdi
    call    *(%rax)
    cltq
    addq    %rax, %r12
    subl    $1, %ebx
    jne .L7
    subl    $1, %r13d
    jne .L6
    call    clock

Yet the performance difference I see is:

Single: 9.4 seconds
Multiple: 8.06 seconds

Xeon X5482, Ubuntu, GCC 4.6.1 x64.

This leads me to the conclusion that the difference must be data dependent.

If you look at that assembly, you'll notice that the only instructions that could have variable latency are the loads:

    ; %rbp = vtable

movq    0(%rbp), %rax   ; Dereference function pointer from vtable
movq    %rbp, %rdi
call    *(%rax)         ; Call function pointer - f()

followed by a few more memory accesses inside the call the f().

It just happens to be that in the single inheritance example, the offsets of the aforementioned values are not favorable to the processor. I have no idea why. ~~But I had to suspect something, it'd be cache-bank conflicts in a similar manner to region 2 in the diagram of this question.~~

By rearranging the code and adding dummy functions, I can change these offsets - which in a lot of cases will eliminate this slow down and make the single inheritance as fast as the multiple inheritance case.

For example, removing the system("pause") inverts the times:

#ifdef SINGLE
class A {
  public:
    virtual int f() { return a; } 
};

class B : public A {
  public:
    virtual int f() { return a; }                                      
    void g() { return; }                                               
};       
#endif

#ifdef MULTIPLE
class A {
  public:
    virtual int f() { return a; }
};

class dummy {
  public:
    virtual void g() { return; }
};

class B : public A, public dummy {
  public:
    virtual int f() { return a; }
    virtual void g() { return; }
};
#endif

int main() {
    cin >> a;
    A* obj;
    if (a > 3)
        obj = new B();
    else
        obj = new A();

    unsigned long result = 0;


    clock_t time0 = clock();

    for (int i=0; i<65535; i++) {
        for (int j=0; j<65535; j++) {
            result += obj->f();
        }
    }      

    clock_t time1 = clock();   
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;    

    cout << result << "\n";
//    system("pause");
}

Single: 8.06 seconds
Multiple: 9.4 seconds

I don't think cache bank conflicts are to blame, the instruction cache is fundamentally different than the data L1 (predecoding, instruction boundaries etc.) I would rather suspect something strange going on with branch prediction or the alignment of functions, but this is even more speculative — Gunther Piez, May 04 '12 at 00:12
Yeah, it was obviously just a wild guess. It's hard to test anything because any modification to the code will change all the offsets. — Mysticial, May 04 '12 at 00:14
I guess if you could pause the execution and inspect running code in memory (something like a game trainer) that may help you? — owagh, May 04 '12 at 00:40
Yeah, I figure a cycle accurate emulator will work. Not sure if they exist for current Intel machines outside of Intel itself. — Mysticial, May 04 '12 at 00:42
Well this seems to be the most sensible answer along with the explanation in your previous question. I'll upvote this but hope for a more definitive answer. — owagh, May 07 '12 at 13:31
Yeah, normally I can provide a test case that definitively shows that X is causing the difference. But here I can't because any change in the code will randomly flip the times between `8.06` and `9.4` seconds on my machine. Even printing out those addresses will do that... so. :( — Mysticial, May 07 '12 at 15:07
@Mysticial Hey check out my answer below. I think I may have figured it out. — owagh, May 07 '12 at 15:33

score 16 · Accepted Answer · answered May 07 '12 at 15:30

I think I got at least some further lead on why this may be happening. The assembly for the loops is exactly identical but the object files are not!

For the loop with the cout at first (i.e.)

cout << "vtable: " << *(void**)obj << endl;

for (int i=0; i<65535; i++) {
  for (int j=0; j<65535; j++) {
    result+=obj->f();
  }
}

I get the following in the object file :-

40092d:       bb fe ff 00 00          mov    $0xfffe,%ebx                                       
400932:       48 8b 45 00             mov    0x0(%rbp),%rax                                     
400936:       48 89 ef                mov    %rbp,%rdi                                          
400939:       ff 10                   callq  *(%rax)                                            
40093b:       48 98                   cltq                                                      
40093d:       49 01 c4                add    %rax,%r12                                          
400940:       ff cb                   dec    %ebx                                               
400942:       79 ee                   jns    400932 <main+0x42>                                 
400944:       41 ff c5                inc    %r13d                                              
400947:       41 81 fd fe ff 00 00    cmp    $0xfffe,%r13d                                      
40094e:       7e dd                   jle    40092d <main+0x3d>

However, without the cout, the loops become :- (.cpp first)

for (int i=0; i<65535; i++) {
  for (int j=0; j<65535; j++) {
    result+=obj->f();
  }
}

Now, .obj :-

400a54:       bb fe ff 00 00          mov    $0xfffe,%ebx
400a59:       66                      data16                                                    
400a5a:       66                      data16 
400a5b:       66                      data16                                                    
400a5c:       90                      nop                                                       
400a5d:       66                      data16                                                    
400a5e:       66                      data16                                                    
400a5f:       90                      nop                                                       
400a60:       48 8b 45 00             mov    0x0(%rbp),%rax                                     
400a64:       48 89 ef                mov    %rbp,%rdi                                          
400a67:       ff 10                   callq  *(%rax)
400a69:       48 98                   cltq   
400a6b:       49 01 c4                add    %rax,%r12                                          
400a6e:       ff cb                   dec    %ebx                                               
400a70:       79 ee                   jns    400a60 <main+0x70>                                 
400a72:       41 ff c5                inc    %r13d                                              
400a75:       41 81 fd fe ff 00 00    cmp    $0xfffe,%r13d
400a7c:       7e d6                   jle    400a54 <main+0x64>

So I'd have to say it's not really due to false aliasing as Mysticial points out but simply due to these NOPs that the compiler/linker is emitting.

The assembly in both cases is :-

.L30:
        movl    $65534, %ebx
        .p2align 4,,7                   
.L29:
        movq    (%rbp), %rax            
        movq    %rbp, %rdi
        call    *(%rax)
        cltq    
        addq    %rax, %r12                                                                        
        decl    %ebx
        jns     .L29
        incl    %r13d 
        cmpl    $65534, %r13d
        jle     .L30

Now, .p2align 4,,7 will insert data/NOPs until the instruction counter for the next instruction has the last four bits 0's for a maximum of 7 NOPs. Now the address of the instruction just after p2align in the case without cout and before padding would be

0x400a59 = 0b101001011001

And since it takes <=7 NOPs to align the next instruction, it will in fact do so in the object file.

On the other hand, for the case with the cout, the instruction just after .p2align lands up at

0x400932 = 0b100100110010

and it would take > 7 NOPs to pad it to a divisible by 16 boundary. Hence, it doesn't do that.

So the extra time taken is simply due to the NOPs that the compiler pads the code with (for better cache alignment) when compiling with the -O2 flag and not really due to false aliasing.

I think this resolves the issue. I am using http://sourceware.org/binutils/docs/as/P2align.html as my reference for what .p2align actually does.

+1 I like this. I also considered the possibility of instruction alignment. But I was never able to test it. — Mysticial, May 07 '12 at 15:39

score 5 · Answer 3 · answered May 04 '12 at 00:22

5

This answer is even more speculative.

After tinkering with this for 5 minutes and reading Mysticials answer, the conclusion is that it is a hardware issue: The code generated in the hot loop is basically the same, so it isn't an issue with the compiler, that leaves the hardware as the only suspect.

Some random thoughts:

Branch prediction
Alignment or partial aliasing of branch (=function) target addresses
L1 cache running hot after reading the same address all the time
Cosmic rays

answered May 04 '12 at 00:22

Gunther Piez

29,760
6
71
103

Can you elaborate on what you mean by alignment or partial aliasing and how that might impact things? L1 cache running hot should actually make it faster rather than slower? – owagh May 04 '12 at 01:43
@owagh [This question that I linked to](http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops) from my answer is probably the most notorious example on StackOverflow of where alignment and partial aliasing can decimate performance. So it might be a good read. How it applies to your question is unclear. Any attempt to test a hypothesis requires modifying the code which will shift the alignment of everything. So it's a moving target that I cannot hit. (as I've shown by commenting out an irrelevant line of code to invert the performance numbers...) – Mysticial May 04 '12 at 02:27
1

@owagh While writing a program to test random access time of the caches and the memory, I noticed that when testing an access pattern of exactly 2 Mib on a Core2 with 6 MiB 2nd level cache, the temperature jumped close to the throttle frequency. This happened only while running on one of the four cores and with exactly 2 MiB, not 4, not 1. This would almost make a good question here :-) – Gunther Piez May 04 '12 at 09:08
So I suppose we can assume it has something to do with the alignment but we're not exactly sure *what* – owagh May 04 '12 at 15:20

Ben Voigt · Answer 4 · 2012-05-03T20:59:22.773

1

With your current code, the compiler is free to devirtualize the calls to obj->f(), since obj cannot have any dynamic type other than class B.

I'd suggest

if (a>3) {
    B* objb = new B();
    objb->a = 5;
    obj = objb;
}
else
    obj = new A();

edited May 03 '12 at 20:59

answered May 03 '12 at 20:52

Ben Voigt

277,958
43
419
720

Because there's no else statement? – owagh May 03 '12 at 20:55
Oh I see. Smart compiler - wasn't aware it could do that. – David May 03 '12 at 20:55
@Dave: Because in the original code, if `obj = new B()` isn't executed, it's using an indeterminate pointer, causing undefined behavior. Undefined behavior gives the compiler all kinds of freedom to do optimizations. Not saying the compiler is that smart, just that the C++ language standard allows this optimization. – Ben Voigt May 03 '12 at 20:57
Nopes, still the same though... Updating my question to reflect that. – owagh May 03 '12 at 20:58

score 0 · Answer 5 · answered May 03 '12 at 20:51

0

My guess is class B : public dummy, public A has unfavorable alignment as far as A is concerned. Pad dummy to 16 bytes and see if there is a difference.

answered May 03 '12 at 20:51

Anycorn

50,217
42
167
261

1

I think that part was expected. The issue is the timing difference between `class B : public A` and `class B : public A, public dummy` in favor of multiple inheritance. – Ben Voigt May 03 '12 at 20:53
Yes, this was understood: "as one might expect thanks to 'thunk' adjustments". – Joe May 03 '12 at 20:54
@ Ben Voigt That is precisely what I'd expect in both situations. – owagh May 03 '12 at 20:55
@Anycorn [No Problem](http://stackoverflow.com/questions/2641489/what-is-a-thunk). – Joe May 03 '12 at 21:22

Why am I observing multiple inheritance to be faster than single?

5 Answers5