2

I'm running my Z80 emulator both in Chrome and in Node. I get about 10x the performance in Chrome that I do in Node. (100k Z80 instructions take 6 ms in Chrome and 60 ms in Node.) I've run the profiler:

% node --prof index.js
% node --prof-process isolate-0x108000000-25550-v8.log

and it says that 95% of the time is spent in C++:

[Summary]:
  ticks  total  nonlib   name
   103    3.8%    3.8%  JavaScript
  2604   95.2%   95.8%  C++
     6    0.2%    0.2%  GC
    17    0.6%          Shared libraries
    12    0.4%          Unaccounted

The C++ breakdown is:

[C++ entry points]:
  ticks    cpp   total   name
  2127   98.3%   77.7%  T __ZN2v88internal40Builtin_CallSitePrototypeGetPromiseIndexEiPmPNS0_7IsolateE
    32    1.5%    1.2%  T __ZN2v88internal21Builtin_HandleApiCallEiPmPNS0_7IsolateE

I've tracked down CallSitePrototypeGetPromiseIndex to this source file. I'm not using promises, async, or await in my code. My test is just a tight loop of 100k emulated Z80 instructions, no I/O or anything.

I've found others online using the --prof flag and none are finding this in their results. Is it a side-effect of profiling? Am I triggering promises somehow inside the loop? Any reason Node should be this much slower than Chrome?

Details: Node v12.13.1, Chrome 79.0.3945.88.

Lawrence Kesteloot
  • 4,149
  • 2
  • 31
  • 28
  • There are things that create `Promises` that are not immedeatly obvious. `setTimeout` for example returns one. – Holli Jan 02 '20 at 01:23
  • @Holli I indeed was using `setTimeout`, but turning that off doesn't change the profiling or timing results. – Lawrence Kesteloot Jan 02 '20 at 02:06
  • Everything that isn't pure compute (e.g. disk access, network access) will user promises under the hood. The source file in your link looks like it's used for generating stack traces - is it possible that your code generates exceptions and chrome is somehow generating them more efficiently? – root Jan 02 '20 at 05:25
  • @root everything is compute-only with no exceptions. But I found the problem, I'll write up a proper answer below. – Lawrence Kesteloot Jan 02 '20 at 21:43

1 Answers1

3

Okay this surprisingly similar question had a great answer by Esailija pointing me to this line in the V8 source code. It limits optimizations of switch statements to those under a certain size. The first thing my emulator does is have a 256-entry switch dispatch for the opcode. In my test I'm only passing it 0 (NOP), so it was safe to comment out huge chunks of the cases. Turns out if I comment out 13 of the cases, the performance jumps up by a factor of 25! If I only comment out 12 of the cases, then I get the slow performance.

The link into the V8 source code above is pretty old (2013), so I tried to find the modern equivalent. I didn't find a hard limit, but found several heuristics that decide between table lookups and tree (binary search) lookups (ia32, x86). When I plug in my numbers I don't quite get a borderline case where I found it, so I'm not sure this is the actual cause or whether there's another optimization not being triggered elsewhere.

As to the difference with Chrome, there's probably some subtle difference in when and how they decide to optimize their switches.

I'm not sure what the best solution is here, but clearly I need to avoid large switch statements. I'll either have a sequence of smaller switch statements, or replace the whole thing with an array of functions.

Update: I used an array of functions and my entire program sped up by a factor of 25.

Lawrence Kesteloot
  • 4,149
  • 2
  • 31
  • 28