28

I was writing code that looked like the following...

if(denominator == 0){
    return false;
}
int result = value / denominator;

... when I thought about branching behavior in the CPU.

https://stackoverflow.com/a/11227902/620863 This answer says that the CPU will try to correctly guess which way a branch will go, and head down that branch only to stop if it discovers it guessed the branch incorrectly.

But if the CPU predicts the branch above incorrectly, it would divide by zero in the following instructions. This doesn't happen though, and I was wondering why? Does the CPU actually execute a division by zero and wait to see if the branch is correct before doing anything, or can it tell that it shouldn't continue in these situations? What's going on?

Community
  • 1
  • 1
Anne Quinn
  • 12,609
  • 8
  • 54
  • 101
  • 9
    Related: [Can branch prediction crash my program?](http://stackoverflow.com/q/24611405/464709) – Frédéric Hamidi Aug 03 '15 at 08:38
  • "only to stop if it discovers does it" does not mean it will execute it, it will stop and go back to the right branch without executing the wrong one , branch optimization can not(I have not read it anywhere but its pretty obvious) change the control flow of a program. – rahul tyagi Aug 03 '15 at 08:40
  • Hmmm... isn't the link provided by Frédéric more than just "related" but a duplicate? I'm hesistant to close-vote, since the answer is not actually that detailed, though. – DevSolar Aug 03 '15 at 08:44
  • @DevSolar, that's why I did not vote to close either :) – Frédéric Hamidi Aug 03 '15 at 08:46
  • 1
    @FrédéricHamidi: Well, *you* could and rely on peer review. If *I* did it, the question would be closed without a second opinion. Bane of the tag badge. :-D – DevSolar Aug 03 '15 at 08:47
  • 3
    Actually not a duplicate, because the link is to an ia-32 question. C++ ignores the underlying architecture and specifies high-level behavior. So even on an architecture where faults are visible on predicted branches, the compiler has to mask them. – MSalters Aug 03 '15 at 09:20

4 Answers4

26

The CPU is free to do whatever it wants, when speculatively executing a branch based on a prediction. But it needs to do so in a way that's transparent to the user. So it may stage a "division by zero" fault, but this should be invisible if the branch prediction turns out wrong. By the same logic, it may stage writes to memory, but it may not actually commit them.

As a CPU designer, I wouldn't bother predicting past such a fault. That's probably not worth it. The fault probably means a bad prediction, and that will resolve itself soon enough.

This freedom is a good thing. Consider a simple std::accumulate loop. The branch predictor will correctly predict a lot of jumps (for (auto current = begin, current != end; ++current) which usually jumps back to the begin of loop), and there are a lot of memory reads which may potentially fault (sum += *current). But a CPU that would refuse to read a memory value until the previous branch has been resolved would be a lot slower. And yet a mispredicted jump at the end of the loop might very well cause a harmless memory fault, as the predicted branch tries to read past the buffer. This needs to be resolved without a visible fault.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 1
    True, for a given definition of "visible". A speculative load may for example knock some lines out of the cache, or cause other micro-architectural effects that are not strictly visible, but still possible to detect if you know how. Some security attacks are actually based on that. – Leeor Aug 03 '15 at 15:32
  • 1
    "Visible" here specifically is "visible to C++" or "Observable side-effects" to use the Standard term. I can imagine a CPU which allows turning on and off such speculative execution, so that highly security-conscious programs would turn it off and C++ compilers would leave it on. – MSalters Aug 04 '15 at 15:49
6

Not exactly. The system is not allowed to execute the instructions in the wrong branch even if it does a bad guess, or more exactly if it does it must not be visible. The basic is :

  • there is a test somewhere in the machine code.
  • the processor loads it pipeline with instructions on one of the possible paths and possibly executes them internally - according to MSalters, some processor could even execute both paths (*)
  • if it made a good guess, fine, the following instruction have been preloaded in processor cache or already executed, and all goes as fast as possible
  • if it made a wrong guess, it just have to clean everything and restart on the correct branch.

For the analogy with the referenced post, the train has to stop immediately at the junction if the switch was not in correct position, it cannot go to next station on the wrong path, or if it cannot stop before that, no passengers shall be allowed to go in or out of the train

(*) Itanium processors would be able to process many paths in parallel. Intel's logic was that they can build wide processors (which do a lot in parallel) but they were struggling with the effective instruction rate. By speculatively executing both branches, they used a lot of hardware (I think they could do it several levels deep, running 2^N branches) but it did help the apparent single core speed as it in effect always predicted the correct branch in one HW unit - Credits should go to MSalters for that precision

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • @MSalters : Of course you are right ! I've used your answer to fix mine. If you don't like that, just tell me and I'll delete it. – Serge Ballesta Aug 03 '15 at 09:30
  • As stated that is not correct. IIRC the ARM manual explicitly states that instructions can actually execute and their result is later invalidated (ie ignored) if the instruction should not be taken. As for your analogy: once when taking the train from work it passed the junction on the incorrect track, then it backed to the junction and continued on the correct track (so the analogy holds even if your assumtion on trains are incorrect too). – skyking Aug 03 '15 at 13:45
  • @skyking ... and people who may have left the train in the meantime are forced to reboard the train. – John Dvorak Aug 03 '15 at 18:44
  • If I want to nitpick, the second bullet probably should state that one _or both_. I think Itanium did that. Intel's logic was that they can build wide processors (which do a lot in parallel) but they were struggling with the effective instruction rate. By speculatively executing both branches, they used a _lot_ of hardware (I think they could do it several levels deep, running 2^N branches) but it did help the apparent single core speed as it in effect always predicted the correct branch in one HW unit. – MSalters Aug 04 '15 at 15:53
  • @MSalters I've included your comment in the answer, saying that credits should go to you – Serge Ballesta Aug 04 '15 at 16:03
0

But if the CPU predicts the branch above incorrectly, it would divide by zero in the following instructions. This doesn't happen though, and I was wondering why?

It may well happen, however the question is: Is it observable? Obviously, this speculative division by zero does not and should not "crash" the CPU, but this does not even happen for a non-speculative division by zero. There is a long causal chain between the division by zero and your process exiting with an error message. It goes somewhat like this (on POSIX, x86):

  • The ALU or the microcode responsible for the division flags the division by zero as an error.
  • The interrupt descriptor #0 is loaded (int 0 signifies a division by zero error on x86).
  • A set of registers (including the current program counter) is pushed onto the stack. The corresponding cache lines may need to be fetched first from RAM.
  • The interrupt handler is executed (a piece of kernel code). It raises a SIGFPE signal in the current process.
  • Eventually, signal handling decides that the default action shall be taken (assuming you didn't install a handler), which is to display an error message and terminate the process.
  • This takes many additional steps (e.g. use of device drivers) until eventually there is a change observable by the user, namely some graphics output by memory-mapped I/O.

This is a lot of work, compared to a simple, error-free division, and a lot of it could be executed speculatively. Basically anything until the actual mmap'ed I/O, or until the finite set of resources for speculative execution (e.g. shadow registers and temporary cache lines) are exhausted. The latter is likely to happen much, much sooner. In this case, the speculative branch needs to be suspended, until it is clear whether it is actually taken and the changes should be committed (once the changes are written, the speculative execution resources can then be released), or whether the changes should be discarded.

The important bit is: As long as none of the speculative execution state becomes visible to other threads, other speculative branches on the same thread, or other hardware (such as graphics), anything goes for optimization. However, realistically, MSalters is absolutely right that a CPU designer would not care to optimize for this use case. So it is equally my opinion, that a real CPU will probably just suspend the speculative branch once the error flag is set. This at most costs a few cycles if the error is even legitimate, and even that is unlikely because the pattern you described is common. Doing speculative execution past this point would only divert precious optimization resources from more important cases.

(In fact, the only processor exception I would want to make reasonably fast, were I a CPU designer, is a specific type of page fault, where the page is known and accessible, but the "present" flag is cleared, just because this happens commonly when virtual memory is used, and is not a true error. Even this case is not terribly important, though, because the disk access on swapping, or even just memory decompression, is typically much more expensive.)

Arne Vogel
  • 6,346
  • 2
  • 18
  • 31
  • 1
    I don't think the last paragraph is realistic. Even with a good SSD, swapping 4 KB costs you at least 50 us (20.000 random IOPS) whereas the branch prediction savings are typically several clockcycli (<1ns). That's more that 4 orders of magnitude. IOW, speeding up swapping by 5 ns just isn't worth it. – MSalters Aug 04 '15 at 16:05
0

Division by zero is nothing really special. It is a condition that is handled by the ALU to yield some effect, such as assigning a special value to the quotient. It can also raise an exception if this exception type has been enabled.

Compare to the snippet

if (denominator == 0) {
    return false;
}
int result = value * denominator;

The multiply can be executed speculatively, then canceled without you knowing. Same for a division. No worries.

  • Good point. On IEEE754, division by zero just gives you infinity, and that won't break the branch predictor at all. Subsequent operations on that value might hit microcode, though, and that might complicate things even further. – MSalters Aug 04 '15 at 16:14