2

Which would be a more efficient dispatch method for making my fetch-decode-execute times that little bit quicker?

For simplicity, I've kept this to a minimum, like operations operate on 1 byte operands and there are only two, for example.

The method I'm using at the moment (simplified) is:

typedef unsigned char byte;

vector<byte> _program = { INST::PUSH, 32, INST::POP};

enum INST {
    PUSH =0, /*index=0*/
    POP =1, /*index=1*/
}

//DISPATCHING METHOD #1
switch (curr_instruction) {
    case INST::PUSH: {
        /*declared inline*/ _push_to_stack(_program[instr_ptr+1]);
    }
    case INST::POP: {
        /*declared inline*/ _pop_stack();
    }
}

OR using a function pointer table to execute each instruction in the 'program' (vector of bytes/ vector _program), like so:

typedef void (*voidptr)();

void hndl_push(){
    /*declared inline*/ _push_to_stack(_program[instr_ptr+1]);
}
void hndl_push(){
    /*declared inline*/ _pop_stack();
}

funcptr handlers[2] = {&hndl_push /*index=0*/, & hdnl_pop /*index=1*/}'
vector<byte> _program = { INST::PUSH, 32, INST::POP};

size_t instr_ptr=0;

//DISPATCHING METHOD #2
while (instr_ptr != _program.size()){
    instr_ptr++;
    _handlers[instr_ptr]();
}

I am using the VC++ (Visual Studio) compiler, the 2015 version.

Which of these is converted into more efficient assembler with the least overhead, or are they the same?

Thank you in advance!

Seki
  • 11,135
  • 7
  • 46
  • 70
SolaGratia
  • 309
  • 4
  • 18

3 Answers3

5

The only way to know which would be faster is to measure.

The optimizer may be able to do quite a bit with either technique. Dense switch statements are often reduced to a jump table, and the function calls may be inlined, so that could be the fastest approach.

But if, for whatever reason, the optimizer cannot inline or if the switch statement becomes a cascaded of if-else statements, then the function pointer calls may be faster.

Wikipedia has a decent article on threaded code, which describes various techniques to dispatch opcodes in a virtual machine or interpreter.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • A jump table apparently being the fastest possible way to implement dispatching (well, conceptually/not considering architecture implementations), is the latter example, then, as fast as a jump table? Specific to the VC++ compiler which was the question. Thanks. – SolaGratia Mar 18 '16 at 17:30
  • 1
    Function calls have overhead: pushing parameters, a prolog, epilog, exception handling, frame pointers. In mainline code, the optimizer may be able to eliminate much of that overhead in ideal circumstances. But calling functions through pointers limits what the optimizer can do. So it'll never be as lean as a jump table. C++ doesn't give a direct way to create a jump table, but a simple switch statement may compile into one. So my guess is #1 is faster, but the only way to know is to measure because myriad factors affect the optimizer. – Adrian McCarthy Mar 18 '16 at 18:36
0

How could the second solution possibly be quicker than the first, but best the compiler could convert the second into the first anyway.

As a side note you need to change the program pointer dependant on the opcode.

  • I don't know, that's why I asked the question :] Also, this example is criminally simplified but the gist is the same: switch-case, vs function pointer array dispatch. My example doesn't exclude an op-code handler handling the program counter. – SolaGratia Mar 18 '16 at 17:28
0

Indirect branch prediction is hard, but at least there's only the one unconditional branch. The branch predictor needs to correctly predict the branch target address to be able to keep the pipeline fed.

However, an unpredictable conditional branch is bad, too. With enough cases, a single indirect branch will do better than multiple conditional branches, so that's what compilers do under the hood. With just two cases, you will almost certainly get better results from letting the compiler choose how to implement the switch.

Conditional branch predictors in some CPUs might be better at recognizing simple (but non-trivial) patterns than indirect branch predictors, but Intel SnB-family CPUs at least can recognize some target-address patterns for indirect branches. (Agner Fog's microarch pdf has a bit of info on branch predictors.)


See Lookup table vs switch in C embedded software, including my answer on that question where I posted some x86 compiler output.

Note the fact that clang will use a jump table to branch to a call instruction, rather than putting the function pointers themselves into an array. So when a jump table is the right choice, implementing it yourself with an array of function pointers will make better code than current clang.

Your code is exactly the same case: dispatching to a lot of functions that all have the same prototype and (lack of) args, so they can go into an array of function pointers.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847