2

So I have recently been studying about Pipeline processor architecture, mainly in the context of Y86-64. There, I have just read about Branch Prediction and how in case of a mispredicted branch, the Fetch, Decode and Execute Pipeline registers have to be flushed and the new correct branch instruction has to be processed.

I was wondering if it is possible to actually design a hardware, with maybe 2 sets of pipeline registers such that when it fetches a conditional instruction, it starts processing both outcomes in parallel, updating one set of registers as if the branching will not take place and the other set as if the branching will take place.

Noticeably, the problem arises if one or both of the branches in turn lead to instruction that themselves also a branching instruction, then 2 sets are not sufficient. But since by the time the first branch condition reaches the execute stage, we will know which branch to actually take, and so we can eliminate the wrong branch and all of its sub branches as well. And since it will take 3 clock cycles for the first branch instruction to get from the Fetch to the Execute stage, I would think that we would, in the worst case, only need 2^3, which is 8 sets of pipeline registers.

Besides this being a little difficult to implement hardware wise, is there anything wrong with my assumption that this approach would work? Or is this already being done in more sophisticated architectures like X86-64 maybe?

Thanks.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
dkapur17
  • 476
  • 1
  • 3
  • 11
  • 1
    I thought about this too. I suppose it's difficult since the decoder is a significant chunk of the CPU logic and duplicating it would take a lot die space. – fuz Apr 23 '20 at 18:51
  • @fuz as you said, this causes an obvious space issue, but is there anything other than space constraint stopping this from working out? – dkapur17 Apr 23 '20 at 19:01
  • 2
    @dkapur17: How much of the CPU's resources would be wasted when there's no branch? Possible answer are "it can do both outcomes at full speed, so half the CPU's resources are wasted when there's no branch" (where multi-core would be better for performance/utilization of CPU's resources); "it can do both outcomes at reduced speed, so less than half of CPU's resources are wasted when there's no branch" (where SMT would be better for performance) and "it can do both outcomes at half speed, so none of CPU's resources are wasted" (where there's no benefit at all). – Brendan Apr 23 '20 at 19:50
  • @Brendan, yes... That seems to be a valid point! – dkapur17 Apr 23 '20 at 19:56
  • Related: [Why not just predict both branches?](https://stackoverflow.com/a/49637475). But really the thing to keep in mind is what *else* could you have spent that die-area and power on. e.g. 4-wide superscalar / out-of-order exec, and a good branch predictor. See [Modern Microprocessors A 90-Minute Guide!](http://www.lighterra.com/papers/modernmicroprocessors/) You basically have 8 pipelines, most of an 8-core CPU (minus interconnects and data cache coherency... and with 8 slow scalar cores). If they're truly independent, instruction-fetch / I-cache read ports become an even bigger probl – Peter Cordes Apr 23 '20 at 20:22

1 Answers1

3

As far as RISC vs. CISC architectures, the latter tried techniques roughly like what you suggest around the late 1980s/early 1990s as I recall. Checking Wikipedia for branch prediction analysis doesn't have an article but does redirect to this in the RSA (encryption) article which describes a technique exploiting the branch predictor which helps find a private encryption key. It also mentions simultaneous multithreading as a way to speed up branch prediction.

To more directly address your question, see the details section in simultaneous multithreading. Generally, it seems to be an area of ongoing research and disagreement.

wallyk
  • 56,922
  • 16
  • 83
  • 148
  • 1
    That seems interesting. I'll be sure to give it a read. Thanks! – dkapur17 Apr 23 '20 at 19:08
  • The branch prediction side-channel is a timing attack against predictors that *do* pick one way, and thus are slower when they picked wrong. The OP's proposed design would defeat that, but so would ordinary branchless code. (Avoiding *data dependent* branches, I mean. You still need loops dependent on the key size and so on.) – Peter Cordes Apr 23 '20 at 20:26
  • @dkapur17: SMT (e.g. hyperthreading) mitigates the throughput cost of all stalls (by keeping the pipeline supplied with other work to do from another thread). It's somewhat related to this hardware branching idea in that you're running code from 2 program counters with replicated register files, but SMT lets them be truly independent: the core looks like two CPUs to the OS. – Peter Cordes Apr 23 '20 at 20:28