30

We are writing a byte-code for a high-level compiled language, and after a bit of profiling and optimization, it became clear that the current largest performance overhead is the switch statement we're using to jump to the byte-code cases.

We investigated pulling out the address of each case label and storing it in the stream of byte-code itself, rather than the instruction ID that we usually switch on. If we do that, we can skip the jump table, and directly jump to the location of code of the currently executing instruction. This works fantastically in GCC, however, MSVC doesn't seem to support a feature like this.

We attempted to use inline assembly to grab the address of the labels (and to jump to them), and it works, however, using inline assembly causes the entire function to be avoided by the MSVC optimizer.

Is there a way to allow the optimizer to still run over the code? Unfortunately, we can't extract the inline assembly into another function other than the one that the labels were made in, since there's no way to reference a label for another function even in inline assembly. Any thoughts or ideas? Your input is much appreciated, thanks!

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Trevor Sundberg
  • 1,584
  • 1
  • 14
  • 25
  • 3
    Have you tried function pointers? – Oliver Charlesworth Jun 21 '11 at 07:31
  • How about putting addresses of functions instead of addresses of labels in the bytecode? Then you have one function for each instruction ID. Unless your fetch-execute loop is in your big function-with-labels. – Cheers and hth. - Alf Jun 21 '11 at 07:33
  • 1
    If I used functions for each of the cases and used function pointers instead of label addresses, it would work. However, I feel like the function call overhead would be so great that it would nullify any performance gain, even if the function was trivial (no arguments, no return). I'll try it out though, and thanks for posting. – Trevor Sundberg Jun 21 '11 at 16:30
  • If I'm guessing correctly about the hardware you're working on, then the problem with function call overhead is the pipeline bubble caused by the indirect branch, which can't be predicted on certain in-order PowerPC processors. But you have that same bubble with a computed goto anyway -- in fact, a function pointer *is* a computed goto, at the ABI level. The function parameters will be passed in registers if you use the FASTCALL convention, so they won't contribute to overall runtime as the movs fit into the indirect bubble anyway. – Crashworks Jul 15 '11 at 06:20
  • 1
    @Rovert while this is too late for you but for future readers: when it comes to low-level performance you ought to assume _nothing_ and benchmark _everything_. You are likely right, but its something that you should check and recheck every new _architecture_ and compiler _version_ since it's likely to change. – Sled Aug 10 '15 at 16:21
  • BTW: This idea of replacing the switch-on-instructions with direct jumps-to-addresses is called _threaded code_, and Wikipedia has a decent article discussing a few approaches. – Adrian McCarthy Jan 29 '16 at 19:01
  • Jump threading in interpreters is often not needed on newer CPUs (Haswell and later), since IT-TAGE branch prediction can learn the more complex patterns. [Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore](https://inria.hal.science/hal-01100647/document) from 2015. See also [How does Branch Prediction affect performance in R?](https://stackoverflow.com/q/58399395) for some other discussion and links. – Peter Cordes May 06 '23 at 17:19

4 Answers4

18

The only way of doing this in MSVC is by using inline assembly (which basically buggers you for x64):

int _tmain(int argc, _TCHAR* argv[])
{
case_1:
    void* p;
    __asm{ mov [p],offset case_1 }
    printf("0x%p\n",p);
    return 0;
}

If you plan on doing something like this, then the best way would be to write the whole interpreter in assembly then link that in to the main binary via the linker (this is what LuaJIT did, and it is the main reason the VM is so blindingly fast, when its not running JIT'ed code that is).

LuaJIT is open-source, so you might pick up some tips from it if you go that route. Alternatively you might want to look into the source of forth (whose creator developed the principle you're trying to use), if there is an MSVC build you can see how they accomplished it, else you're stuck with GCC (which isn't a bad thing, it works on all major platforms).

Crashworks
  • 40,496
  • 12
  • 101
  • 170
Necrolis
  • 25,836
  • 3
  • 63
  • 101
10

Take a look at what Erlang does for building on Windows. They use MSVC for most of the build, and then GCC for one file to make use of the labels-as-values extension. The resulting object code is then hacked to be made compatible with the MSVC linker.

http://www.erlang.org/doc/installation_guide/INSTALL-WIN32.html

goertzenator
  • 1,960
  • 18
  • 28
  • This was what we did for two decades with watcom compiler. It was perfect for i386 code to specify in/out registers and could do a lot of internal tweaking. It was compiled on watcom, disassembled, the assembly extraced and postprocessed and filled with other code. It's a good idea unless your clients start asking if they can extend the language on their own .... you don't want to expose the need of a C compiler than (unless it's tinycc). – Lothar Feb 12 '21 at 10:39
4

It seems you could just move the actual code to functions, instead of case labels. The byte code can then be trivially transformed into direct calls. I.e. byte code 1 would translate to CALL BC1. Since you're generating direct calls, you don't have the overhead of function pointers. The pipelines of most CPU's can follow such unconditional direct branches.

As a result, the actual implementations of each byte code are optimized, and the conversion from byte code to machince code is a trivial 1:1 conversion. You get a bit of code expansion since each CALL is 5 bytes (assuming x86-32) but that's unlikely to be a major problem.

MSalters
  • 173,980
  • 10
  • 155
  • 350
0

The best approach I found for this is to use a switch clause and then for each element of the switch clause you write the goto call.

lablcmd1:
// code
goto dispach;
. . .

dispatch:
op = NextOp();
switch(op){
    case CMD1:
      goto lablcmd1;
    // default: __assume(false); // maybe necessary, verify
}

Manual work but doesn’t use assembly and looks like compatibility is good with MSVC.

eri0o
  • 2,285
  • 4
  • 27
  • 43