So I have recently been studying about Pipeline processor architecture, mainly in the context of Y86-64. There, I have just read about Branch Prediction and how in case of a mispredicted branch, the Fetch, Decode and Execute Pipeline registers have to be flushed and the new correct branch instruction has to be processed.
I was wondering if it is possible to actually design a hardware, with maybe 2 sets of pipeline registers such that when it fetches a conditional instruction, it starts processing both outcomes in parallel, updating one set of registers as if the branching will not take place and the other set as if the branching will take place.
Noticeably, the problem arises if one or both of the branches in turn lead to instruction that themselves also a branching instruction, then 2 sets are not sufficient. But since by the time the first branch condition reaches the execute stage, we will know which branch to actually take, and so we can eliminate the wrong branch and all of its sub branches as well. And since it will take 3 clock cycles for the first branch instruction to get from the Fetch to the Execute stage, I would think that we would, in the worst case, only need 2^3, which is 8 sets of pipeline registers.
Besides this being a little difficult to implement hardware wise, is there anything wrong with my assumption that this approach would work? Or is this already being done in more sophisticated architectures like X86-64 maybe?
Thanks.