Does it make sense to implement own branch prediction optimization in own VM interpreter or it is enough to run VM on hardware that already has branch prediction optimization support?
-
Are you talking about an emulator or interpreter that itself runs on hardware? Your question doesn't make sense for hardware virtualization. (VM exits aren't branch-predicted, and usually the CPU is directly executing the guest machine code). But if you do mean an interpreting emulator, like CPython or BOCHS, then you do potentially need to care about branch prediction in the CPU that will run your code. Recently (like Intel since Haswell), is finally not bad for that: [Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore](https://hal.inria.fr/hal-01100647/document) – Peter Cordes Jun 20 '20 at 23:49
-
@PeterCordes I am talking about the interpreter. For example, would it make sense to implement branch prediction in EVM (Ethereum VM)? – k06a Jun 21 '20 at 10:44
2 Answers
It could make sense in a limited sense.
For example, in a JIT complier, when generating assembly you may decide to lay out code based on the observed branch probabilities. This only needs a very simple type of predictor that knows the overall probability but doesn't need to recognize any patterns. If you did recognize patterns you could do more sophisticated optimizations, e.g. a loop with an embedded branch that alternates every iteration could be unrolled 2x and the body created efficient for the observed case.
For an interpreter it seems a bit less useful, but one can imagine some sophisticated designs that fuse some adjacent instructions together into a single operation for efficiency and this might benefit from branch prediction. Similarly an interpreter might benefit from recognizing loops.

- 60,350
- 16
- 207
- 386
Apparently you're talking about a VM that interprets bytecode, not hardware virtualization of a CPU.
Implement how? Branch prediction in CPUs is only needed because they're pipelined, and for speculative out-of-order execution.
None of those things make sense for interpreter software if it would create more work to implement. Software pipelining can be worth it for loops over arrays to hide load and ALU latency, especially on older in-order CPUs, but that doesn't increase the total number of instructions to be run. If you don't know for sure what needs to be done next, leave the speculation to hardware OoO exec.
Note that for a pure non-JITing interpreter, control dependencies in the guest code become data dependencies in the interpreter, while a sequence of different instructions in the guest creates a control dependency in the interpreter (to dispatch to handler functions). See How exactly R is affected by Branch Prediction?
You do potentially need to care about branch prediction in the CPU that will run your code. Recently (like Intel since Haswell), CPUs are finally not bad for that, using IT-TAGE predictors: Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore.
You don't implement branch prediction in software, but for older CPUs it was worth tuning interpreters with hardware branch prediction in mind. X86 prefetching optimizations: "computed goto" threaded code has some links, especially an article by Darek Mihocka discussing how badly it sucks for older CPUs (current at the time it was written) to have one "grand central" dispatch branch, like a single switch
that every instruction-handler function returns to. That means the entire pattern of which instruction tends to follow which other instruction has to be predicted for that single branch. Without something like IT-TAGE, the prediction state for a single branch is very limited.
Tuning for older CPUs can involve putting dispatch to the next instruction at the end of each handler function, instead of returning to a single dispatch loop. But again, that's not implementing branch prediction, that's tuning for it.

- 328,167
- 45
- 605
- 847