4

Consider a function that calls another function and checks for error. Assume that the function CheckError() returns 0 on failure, and other numbers indicates success.

First version: taken branch for success or fall through to the error processing code (which is in the middle of the function).

    CALL   CheckError
    TEST   EAX,EAX    ;check if return value is 0
    JNZ    Normal
ErrorProcessing:
    ...    ;some error processing code here
Normal:
    ...    ;some usual code here

Second version taken branch on error, or fall through to the normal path. The error processing code is at the end of the function.

    CALL   CheckError
    TEST   EAX,EAX
    JZ     ErrorProcessing
Normal:
    ...    ;some usual code here
ErrorProcessing:
    ...    ;some error processing code here

Which one of these two methods is better? Why?

Personally, I think first code has better code structure (more readable and programmable), because the code is compact. However, I also think that the second code has better speed normally (in the no-error case), because a not-taken conditional jump takes 2-3 clock cycles (maybe I'm too picky here) less than a taken one.

Anyway, I found that all compilers I tested use the first model when it compilers an if statement. For instance:

if (GetActiveWindow() == NULL)
{
    printf("Error: can't get window's handle.\n");
    return -1;
}
printf("Succeed.\n");
return 0;

This should compile to (without any exe entry routine):

    CALL [GetActiveWindow]    ;if (GetActiveWindow() == NULL)
    TEST EAX,EAX
    JNZ CodeSucceed
                             ;printf("Error.......\n"); return -1
    PUSH OFFSET "Error.........\n"
    CALL [Printf]
    ADD ESP,4
    OR EAX,0FFFFFFFFH
    JMP Exit

CodeSucceed:                 ;printf("Succeed.\n"); return 0
    PUSH OFFSET "Succeed.\n"
    CALL [Printf]
    ADD ESP,4
    XOR EAX,EAX
Exit:
    RETN
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
J.Smith
  • 111
  • 8

1 Answers1

10

In terms of cycle counting on the conditional jump itself, which way you structure the code makes absolutely no difference. The only thing that matters anymore is whether the branch is predicted correctly. If it is, the branch costs zero cycles. If it is not, the branch costs tens or maybe even hundreds of cycles. The prediction logic in the hardware doesn't depend on which way the code is structured, and you have basically no control over it (CPU designers have experimented with "hints" but they turn out to be a net lose) (but see "Why is it faster to process a sorted array than an unsorted array?" for how high-level algorithmic decisions can make a huge difference).

However, there's another factor to consider: "hotness". If the "error processing" code will almost never actually get used, it is better to move it out of line — way out of line, to its own subsection of the executable image — so that it does not waste space in the I-cache. Making accurate decisions about when to do that is one of the most valuable benefits of profile-guided optimization — I'd guess second only to deciding on a per-function or even per-basic-block basis whether to optimize for space or speed.

Readability should be a primary concern when writing assembly by hand only if you are doing it as a learning exercise, or to implement something that can't be implemented in a higher level language (e.g. the guts of a context switch). If you are doing it because you need to squeeze cycles out of a critical inner loop, and it doesn't come out unreadable, you've probably got more cycle-squeezing to do.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • 5
    Not-take branches are actually faster, even with perfect prediction. A taken branch can reduce front-end throughput if it's not the last instruction decoded in a group of 4 (because the instructions after it aren't useful), and there's also the I-cache spatial locality issue you mentioned. Also, Intel Haswell-family can execute predicted-not-taken conditional branches on either of two ports, but predicted-taken branches only on port 6. See [Agner Fog's microarch pdf and insn tables](http://agner.org/optimize/), and other stuff in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info) – Peter Cordes Aug 23 '16 at 04:27
  • 2
    **TL:DR**: prefer not-taken branches on the fast path in asm, or compile with profile-guided optimization. Or use [`likely` / `unlikely` macros for GNU C builtins](http://stackoverflow.com/questions/109710/likely-unlikely-macros-in-the-linux-kernel-how-do-they-work-whats-their) to tell the compiler that a condition is almost always one way. Without that, your compiler doesn't know that 0-return means error, so you shouldn't infer anything from how it lays out the branches. – Peter Cordes Aug 23 '16 at 04:30
  • Thank you @zwol, Peter Cordes those really give me lots help on this. I will consider second version more often now when I use assembly. – J.Smith Aug 23 '16 at 16:12
  • 1
    @J.Smith: You could have notified both of us by writing `zwol and @Peter`... zwol already gets notified because you're commenting under their post. I just happened to look at this again and notice your comment. Anyway, welcome to Stack Overflow, thanks for asking an interesting question (unlike many new users). – Peter Cordes Aug 23 '16 at 18:47
  • Actually, "Agner Fog's optimization guide also recommend Second method for speed optimize. For spec, see Page.68 on [link](http://www.agner.org/optimize/optimizing_assembly.pdf) – J.Smith Aug 23 '16 at 20:52
  • @PeterCordes: Hello, I actually got one more question to ask you: What if it is normal codes(just normal condition codes, assume that it sometimes get run) instead of error processing code in my question above? Will this make a different in how to optimize the code? – J.Smith Aug 27 '16 at 16:33
  • 1
    @J.Smith: All that matters is making the common case fast. It doesn't matter why the uncommon case is uncommon, just that it's uncommon. This is exactly what Agner Fog says on page 68 of his Optimizing Assembly guide. If both cases are equally common and don't follow a pattern, it might be worth looking at a branchless option if that can be done efficiently. (see page 70). – Peter Cordes Aug 27 '16 at 20:51
  • Thanks, I got it.@PeterCordes – J.Smith Aug 30 '16 at 20:56