23

I recently read the question here Why is it faster to process a sorted array than an unsorted array? and found the answer to be absolutely fascinating and it has completely changed my outlook on programming when dealing with branches that are based on Data.

I currently have a fairly basic, but fully functioning interpreted Intel 8080 Emulator written in C, the heart of the operation is a 256 long switch-case table for handling each opcode. My initial thought was this would obviously be the fastest method of working as opcode encoding isn't consistent throughout the 8080 instruction set and decoding would add a lot of complexity, inconsistency and one-off cases. A switch-case table full of pre-processor macros is a very neat and easy to maintain.

Unfortunately, after reading the aforementioned post it occurred to me that there's absolutely no way the branch predictor in my computer can predict the jumping for the switch case. Thus every time the switch-case is navigated the pipeline would have to be completely wiped, resulting in a several cycle delay in what should otherwise be an incredibly quick program (There's not even so much as multiplication in my code).

I'm sure most of you are thinking "Oh, the solution here is simple, move to dynamic recompilation". Yes, this does seem like it would cut out the majority of the switch-case and increase speed considerably. Unfortunately my primary interest is emulating older 8-bit and 16-bit era consoles (the intel 8080 here is only an example as it's my simplest piece of emulated code) where cycle and timing keeping to the exact instruction is important as the Video and Sound must be processed based on these exact timings.

When dealing with this level of accuracy performance becomes an issue, even for older consoles (Look at bSnes for example). Is there any recourse or is this simply a matter-of-fact when dealing with processors with long pipelines?

Community
  • 1
  • 1
Fascia
  • 732
  • 6
  • 17
  • FYI: I've found that using a computed goto in gcc is significantly faster than a large switch. – Vaughn Cato Jul 26 '12 at 11:46
  • 2
    Your question doesn't completely make clear to me whether or not you actually did a test to measure performance at all. The post you refer to is truely a beauty, but such information makes people to 'overreact' and solve perfomance issues that did only cause 1% of the performance loss (or make it even worse than it was). Premature optimization is the root of all evil. – Bart Jul 26 '12 at 13:39
  • See also [Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore](https://hal.inria.fr/hal-01100647/document) re: modern IT-TAGE predictors being able to handle a single "grand central dispatch" `switch` statement decently. See also [X86 prefetching optimizations: "computed goto" threaded code](https://stackoverflow.com/q/46321531) and [this Q&A about interpreters](https://stackoverflow.com/a/58400070/224132) - just as relevant for interpreting CPU emulators as for R or Python, as opposed to a JITing emulator. – Peter Cordes Jan 31 '22 at 10:44
  • 1
    Also: *the pipeline would have to be completely wiped* - not quite, [modern OoO exec CPUs with "fast recovery"](https://stackoverflow.com/questions/50984007/what-exactly-happens-when-a-skylake-cpu-mispredicts-a-branch) only need to flush instructions that came *after* the mispredicted branch. They can detect a mispredict and start recovering while some instructions from before the branch aren't complete, so mispredict penalties can overlap with cache misses if you're lucky (and if the branch isn't dependent on data from that cache miss load!). But yeah branch misses still suck in tight loops – Peter Cordes Jan 31 '22 at 10:51

4 Answers4

13

On the contrary, switch statements are likely to be converted to jump tables, which means they perform possibly a few ifs (for range checking), and a single jump. The ifs shouldn't cause a problem with branch prediction because it is unlikely you will have a bad op-code. The jump is not so friendly with the pipeline, but in the end, it's only one for the whole switch statement..

I don't believe you can convert a long switch statement of op-codes into any other form that would result in better performance. This is of course, if your compiler is smart enough to convert it to a jump table. If not, you can do so manually.

If in doubt, implement other methods and measure performance.

Edit

First of all, make sure you don't confuse branch prediction and branch target prediction.

Branch prediction solely works on branch statements. It decides whether a branch condition would fail or succeed. They have nothing to do with the jump statement.

Branch target prediction on the other hand tries to guess where the jump will end up in.

So, your statement "there's no way the branch predictor can predict the jump" should be "there's no way the branch target predictor can predict the jump".

In your particular case, I don't think you can actually avoid this. If you had a very small set of operations, perhaps you could come up with a formula that covers all your operations, like those made in logic circuits. However, with an instruction set as big as a CPU's, even if it were RISC, the cost of that computation is much higher than the penalty of a single jump.

Timo
  • 7,992
  • 4
  • 49
  • 67
Shahbaz
  • 46,337
  • 19
  • 116
  • 182
  • Not on the contrary at all, if you read again you'll see my issue is with the fact that there's no way the branch predictor can predict the jump and thus the pipeline is empty for (I believe, for the latest intel processors) 14 cycles. When executing millions of emulated instructions per second, this adds up, in-fact, I believe it could be one of the biggest bottlenecks for the emulated CPU (as the instruction execution is fairly trivial). My question is what, if any, options are there to get around this downtime? – Fascia Jul 26 '12 at 11:39
  • 1
    Thank you for your edit, I didn't realise there was a distinction between the mechanism behind whether it jumps and where it jumps, that's good to know. I have a feeling you're probably right that there's no options here, which is such a shame because the downtime is a considerably percentage of the overall CPU time taken to execute a single emulated instruction. – Fascia Jul 26 '12 at 11:57
  • 1
    @fascia, unfortunately, decoding instructions _is_ a time consuming operation. I can't figure out a way to search for an image, but even in the CPU, the opcode decoder usually takes a lot of space. That is, most of your CPU "volume" is actually decoding and only a small portion of it doing any computation. – Shahbaz Jul 26 '12 at 12:06
  • What happens if you have 3 cases: 0, 1000, 500000. How might the cpu handle this? – Mathew Kurian Oct 17 '14 at 06:19
  • 1
    @bluejamesbond, it's not the CPU handling those cases, it's the compiler. You can see discussions in [this question](http://stackoverflow.com/q/1837656/912144) or [here](http://www.megasolutions.net/c/switch-case-to-a-jump-table-63885.aspx). If the compiler cannot convert the switch case to jump table, it may skip it, or it may partially do it. In your case, a particularly smart compiler may use `value % 3` as index to jump table, but making sure no other value is accepted is still a problem. You could try searching how `gcc` for example does it, but I doubt it would be easy to find out. – Shahbaz Oct 17 '14 at 08:26
  • @Shahbaz Thanks! I did some examples and compiled it with gcc. It seems to use normal compare and jumps when I did some of these cases with random/large differences between cases. – Mathew Kurian Oct 17 '14 at 08:43
  • @bluejamesbond, it makes sense. There's always a tradeoff and gcc decides which one would be more beneficial (I assume you tested with `-O2` or `-O3`?). It would be interesting to test with 1, 2, 3, 4, 5 first to make sure conversion to jump table is actually done, and then test with 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1000 to see what happens. It _may_ be that gcc would have an `if` for 1000 and in the `else` part use a jump table, which would be interesting to know. – Shahbaz Oct 17 '14 at 08:50
  • 1
    I don't see anyone (not even myself) that modern fairly high end CPUs have indirect branch predictors, often with path history. Such a branch predictor is very much like. BTB, but may predict multiple different targets for the same branch depending on the path through the code. If your code chunks are small enough, thus may include several different passes through the instruction decode loop, and thus may be able to pick up on multiple instruction sequences (e.g. CMP instructions are usually followed by a branch, and not immediately by an instruction that overwrites the flags set by the CMP) – Krazy Glew Dec 06 '16 at 00:25
  • Simple CPUs may not have path oriented indirect branch predictors. – Krazy Glew Dec 06 '16 at 00:26
11

As the branches on your 256-way switch statement are densely packed the compiler will implement this as a jump table, so you're correct in that you'll trigger a single branch mispredict every time you pass through this code (as the indirect jump won't display any kind of predictable behaviour). The penalty associated with this will be around 15 clock cycles on a modern CPU (Sandy Bridge), or maybe up to 25 on older microarchitectures that lack a micro-op cache. A good reference for this sort of thing is "Software optimisation resources" on agner.org. Page 43 in "Optimizing software in C++" is a good place to start.

http://www.agner.org/optimize/?e=0,34

The only way you could avoid this penalty is by ensuring that the same instructions are execution regardless of the value of the opcode. This can often be done by using conditional moves (which add a data dependency so are slower than a predictable branch) or otherwise looking for symmetry in your code paths. Considering what you're trying to do this is probably not going to be possible, and if it was then it would almost certainly add a overhead greater than the 15-25 clock cycles for the mispredict.

In summary, on a modern architecture there's not much you can do that'll be more efficient than a switch/case, and the cost of mispredicting a branch isn't as much as you might expect.

jleahy
  • 16,149
  • 6
  • 47
  • 66
  • 1
    Unfortunately, when dealing with emulation you can be (trying) to execute 10's or even 100's of millions of instructions per second. And if for each one there is 15 cycles of downtime for the pipeline, that really adds up to a considerably performance impact. – Fascia Jul 26 '12 at 11:59
  • 4
    There's no free lunch here. If you want to do one of several things, and it's completely unpredictable you must either execute the code for every (likely) possibility or have a pipeline flush. The only alternative is to JIT-compile what you're trying to emulate into native code (which is how VMWare and other x86 emulators used to work before virtualisation). You cannot expect the processor to speculate execution of your op-code before it's read the op-code from memory. – jleahy Jul 26 '12 at 13:36
8

The indirect jump is probably the best thing to do for instruction decoding.

On older machines, like say the Intel P6 from 1997, the indirect jump would probably get a branch misprediction.

On modern machines, like say Intel Core i7, there is an indirect jump predictor that does a fairly good job of avoiding the branch misprediction.

But even on the older machines that do not have an indirect branch predictor, you can play a trick. This trick is (was), by the way, documented in the Intel Code Optimization Guide from way back in the Intel P6 days:

Instead of generating something that looks like

    loop:
       load reg := next_instruction_bits // or byte or word
       load reg2 := instruction_table[reg]
       jmp [reg]
    label_instruction_00h_ADD: ...
       jmp loop
    label_instruction_01h_SUB: ...
       jmp loop
    ...

generate the code as

    loop:
       load reg := next_instruction_bits // or byte or word
       load reg2 := instruction_table[reg]
       jmp [reg]
    label_instruction_00h_ADD: ...
       load reg := next_instruction_bits // or byte or word
       load reg2 := instruction_table[reg]
       jmp [reg]
    label_instruction_01h_SUB: ...
       load reg := next_instruction_bits // or byte or word
       load reg2 := instruction_table[reg]
       jmp [reg]
    ...

i.e. replace the jump to the top of the instruction fetch/decode/execute loop by the code at the top of the loop at each place.

It turns out that this has much better branch prediction, even in the absence of an indirect predictor. More precisely, a conditional, single target, PC indexed BTB will be quite a lot better in this latter, threaded, code, than on the original with only a single copy of the indirect jump.

Most instruction sets have special patterns - e.g. on Intel x86, a compare instruction is nearly always followed by a branch.

Good luck and have fun!

(In case you care, the instruction decoders used by instruction set simulators in industry nearly always do a tree of N-way jumps, or the data-driven dual, navigate a tree of N-way tables, with each entry in the tree pointing to other nodes, or to a function to evaluate.

Oh, and perhaps I should mention: these tables, these switch statements or data structures, are generated by special purpose tools.

A tree of N-way jumps, because there are problems when the number of cases in the jump table gets very large - in the tool, mkIrecog (make instruction recognizer) that I wrote in the 1980s, I usually did jump tables up to 64K entries in size, i.e. jumping on 16 bits. The compilers of the time broke when the jump tables exceeded 16M in size (24 bits).

Data driven, i.e. a tree of nodes pointing to other nodes because (a) on older machines indirect jumps may not be predicted well, and (b) it turns out that much of the time there is common code between instructions - instead of having a branch misprediction when jumping to the case per instruction, then executing common code, then switching again, and getting a second mispredict, you do the common code, with slightly different parameters (like, how many bits of the instruction stream do you consume, and where the next set of bits to branch on is (are).

I was very aggressive in mkIrecog, as I say allowing up to 32 bits to be used in a switch, although practical limitations nearly always stopped me at 16-24 bits. I remember that I often saw the first decode as a 16 or 18 bit switch (64K-256K entries), and all other decodes were much smaller, no bigger than 10 bits.

Hmm: I posted mkIrecog to Usenet back circa 1990. ftp://ftp.lf.net/pub/unix/programming/misc/mkIrecog.tar.gz You may be able to see the tables used, if you care. (Be kind: I was young then. I can't remember if this was Pascal or C. I have since rewritten it many times - although I have not yet rewritten it to use C++ bit vectors.)

Most of the other guys I know who do this sort of thing do things a byte at a time - i.e. an 8 bit, 256 way, branch or table lookup.)

Krazy Glew
  • 7,210
  • 2
  • 49
  • 62
  • 1
    For anyone interested, this technique is commonly known as "Label as Values" and is supported in gcc and clang. – Chuu Dec 02 '19 at 17:33
  • 1
    FYI, jump threading isn't really necessary for modern IT-TAGE predictors (Intel since Haswell (2013), AMD since Zen2), since they can distribute the history for a single branch across the whole predictor. See [Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore](https://hal.inria.fr/hal-01100647/document). Doing it this way instead of a grand-central-dispatch branch doesn't *hurt*, though, and IDK if all relevant modern CPUs (like ARM server cores) have IT-TAGE or similar predictors. See also [this Q&A about interpreters](https://stackoverflow.com/a/58400070/224132) – Peter Cordes Jan 31 '22 at 10:37
  • 1
    Also Darek Mihocka [discusses this pattern in a 2008 article](http://www.emulators.com/docs/nx25_nostradamus.htm) in the context of an interpreting CPU emulator (like Bochs which he worked on) and how it performed on various CPUs relevant in 2008. [X86 prefetching optimizations: "computed goto" threaded code](https://stackoverflow.com/q/46321531) summarizes that. – Peter Cordes Jan 31 '22 at 10:41
2

I thought I'd add something since no one mentioned it.

Granted, the indirect jump is likely to be the best option.

However, should you go with the N-compare way, there are two things that come to my mind:

First, instead of doing N equality compares, you could do log(N) inequality compares, testing your instructions based on their numerical opcode by dichotomy (or test the number bit by bit if the value space is near to full) .This is a bit like a hashtable, you implement a static tree to find the final element.

Second, you could run an analysis on the binary code you want to execute. You could even do that per binary, before execution, and runtime-patch your emulator. This analysis would build a histogram representing the frequency of instructions, and then you would organize your tests so that the most frequent instructions get predicted correctly.

But I cant see this being faster than a medium 15 cycles penalty, unless you have 99% of MOV and you put an equality for the MOV opcode before the other tests.