75

I've found that != and == are not the fastest ways for testing for zero or non-zero.

bool nonZero1 = integer != 0;
xor eax, eax
test ecx, ecx
setne al

bool nonZero2 = integer < 0 || integer > 0;
test ecx, ecx
setne al

bool zero1 = integer == 0;
xor eax, eax
test ecx, ecx
sete al

bool zero2 = !(integer < 0 || integer > 0);
test ecx, ecx
sete al

Compiler: VC++ 11 Optimization flags: /O2 /GL /LTCG

This is the assembly output for x86-32. The second versions of both comparisons were ~12% faster on both x86-32 and x86-64. However, on x86-64 the instructions were identical (first versions looked exactly like the second versions), but the second versions were still faster.

  1. Why doesn't the compiler generate the faster version on x86-32?
  2. Why are the second versions still faster on x86-64 when the assembly output is identical?

EDIT: I've added benchmarking code. ZERO: 1544ms, 1358ms NON_ZERO: 1544ms, 1358ms http://pastebin.com/m7ZSUrcP or http://anonymouse.org/cgi-bin/anon-www.cgi/http://pastebin.com/m7ZSUrcP

Note: It's probably inconvenient to locate these functions when compiled in a single source file, because main.asm goes quite big. I had zero1, zero2, nonZero1, nonZero2 in a separate source file.

EDIT2: Could someone with both VC++11 and VC++2010 installed run the benchmarking code and post the timings? It might indeed be a bug in VC++11.

NFRCR
  • 5,304
  • 6
  • 32
  • 37
  • 11
    Would you provide the complete program you are using to benchmark the performance? – James McNellis May 31 '12 at 17:56
  • So how does it guarantee the rest of eax is zero if it's just skipping the xor? – harold May 31 '12 at 17:57
  • 1
    Where are the `xor` instructions coming from? They don't look relevant to the test, so it should be part of the surrounding code. – Mark Ransom May 31 '12 at 17:58
  • 2
    What happens if you change the order? The compiler is smart enough to know that it has `xor`:ed `eax` *before* the first test and that that remains valid for the next... – Andreas Magnusson May 31 '12 at 18:02
  • 2
    NFRCR, did you really benchmark that as linear code? I assumed you just pasted them together to keep the size of the post down. – harold May 31 '12 at 18:08
  • I've added benchmarking code now. – NFRCR May 31 '12 at 18:18
  • "I had zero1, zero2, nonZero1, nonZero2 in a separate compilation unit."... the most important optimization by far is inlining, and you prevented it? No one cares about the performance of a tiny function that compares a number to zero. They care how they should write that comparison inside a larger function (call inline function, use `!=`, use `> || <`). Or else at least introduce a virtual call, which is the only realistic situation where this could happen as a standalone function. – Ben Voigt May 31 '12 at 18:25
  • Plus, cross-TU inlining is very common and the compiler can perfectly well do it for simple functions. – Puppy May 31 '12 at 18:27
  • @Ben Voigt. Take a look at the pastebin link. It's all in a single source file now. And performs the same way. – NFRCR May 31 '12 at 18:27
  • @Ben Voight. And it did inline at first as well. I actually turned off inlining and watched the timings go up 3x. So inlining for sure did work. Why would a separate source file prevent inlining? I had /LTCG. – NFRCR May 31 '12 at 18:37
  • @NFRCR: Showing all the optimization flags is appreciated ;) – Ben Voigt May 31 '12 at 18:38
  • @NFRCR: My bloody ISP has blocked pastebin ATM for some unknown reason. Could you post the code or let me know what vc11 is doing with the sources I posted? I don't have a vc11 compiler handy to compare stuff :/ – dirkgently May 31 '12 at 18:54
  • @dirkgently Try: http://anonymouse.org/cgi-bin/anon-www.cgi/http://pastebin.com/m7ZSUrcP – NFRCR May 31 '12 at 18:57
  • Ok, I've taken a look at this with the Visual C++ 2012 RC. I've compiled your code with /O2 and /GL and I've gotten the same x86 assembly dump for each function. Benchmark1 is _consistently_ 20% faster than Benchmark2 on my Core i5 540M (~1433 clocks vs. ~1800 clocks). I get near identical results for x86 and x64. So, I wonder if the generated code is better for some CPUs and worse for others? (I am not a performance guru, so I don't know, but I figured I'd at least take a look and report my findings.) – James McNellis Jun 01 '12 at 06:23

2 Answers2

123

This is a great question, but I think you've fallen victim to the compiler's dependency analysis.

The compiler only has to clear the high bits of eax once, and they remain clear for the second version. The second version would have to pay the price to xor eax, eax except that the compiler analysis proved it's been left cleared by the first version.

The second version is able to "cheat" by taking advantage of work the compiler did in the first version.

How are you measuring times? Is it "(version one, followed by version two) in a loop", or "(version one in a loop) followed by (version two in a loop)"?

Don't do both tests in the same program (instead recompile for each version), or if you do, test both "version A first" and "version B first" and see if whichever comes first is paying a penalty.


Illustration of the cheating:

timer1.start();
double x1 = 2 * sqrt(n + 37 * y + exp(z));
timer1.stop();
timer2.start();
double x2 = 31 * sqrt(n + 37 * y + exp(z));
timer2.stop();

If timer2 duration is less than timer1 duration, we don't conclude that multiplying by 31 is faster than multiplying by 2. Instead, we realize that the compiler performed common subexpression analysis, and the code became:

timer1.start();
double common = sqrt(n + 37 * y + exp(z));
double x1 = 2 * common;
timer1.stop();
timer2.start();
double x2 = 31 * common;
timer2.stop();

And the only thing proved is that multiplying by 31 is faster than computing common. Which is hardly surprising at all -- multiplication is far far faster than sqrt and exp.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Benchmarking code added. I ran benchmark1 and benchmark2 separately, same results. The only difference is the first benchmark that runs, then it "warms up" and is a bit slower. – NFRCR May 31 '12 at 18:20
  • This is somewhat unrelated, but wouldn't a compiler optimize a multiplication by 31 to a `(common << 5) - common`? – Matt May 31 '12 at 18:26
  • 3
    @Matt: Not for floating-point multiplication it wouldn't ;) For integer multiplication, yes I think most compilers know that trick, but depending on architecture it may or may not be faster. IMUL by two is almost certainly converted to a left shift. – Ben Voigt May 31 '12 at 18:30
19

EDIT: Saw OP's assembly listing for my code. I doubt this is even a general bug with VS2011 now. This may simply be a special case bug for OP's code. I ran OP's code as-is with clang 3.2, gcc 4.6.2 and VS2010 and in all cases the max differences were at ~1%.

Just compiled the sources with suitable modifications to my ne.c file and the /O2 and /GL flags. Here's the source

int ne1(int n) {
 return n != 0;
 }

 int ne2(int n) {
 return n < 0 || n > 0;
 }

 int ne3(int n) {
 return !(n == 0);
 }

int main() { int p = ne1(rand()), q = ne2(rand()), r = ne3(rand());}

and the corresponding assembly:

    ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.30319.01 

    TITLE   D:\llvm_workspace\tests\ne.c
    .686P
    .XMM
    include listing.inc
    .model  flat

INCLUDELIB OLDNAMES

EXTRN   @__security_check_cookie@4:PROC
EXTRN   _rand:PROC
PUBLIC  _ne3
; Function compile flags: /Ogtpy
;   COMDAT _ne3
_TEXT   SEGMENT
_n$ = 8                         ; size = 4
_ne3    PROC                        ; COMDAT
; File d:\llvm_workspace\tests\ne.c
; Line 11
    xor eax, eax
    cmp DWORD PTR _n$[esp-4], eax
    setne   al
; Line 12
    ret 0
_ne3    ENDP
_TEXT   ENDS
PUBLIC  _ne2
; Function compile flags: /Ogtpy
;   COMDAT _ne2
_TEXT   SEGMENT
_n$ = 8                         ; size = 4
_ne2    PROC                        ; COMDAT
; Line 7
    xor eax, eax
    cmp eax, DWORD PTR _n$[esp-4]
    sbb eax, eax
    neg eax
; Line 8
    ret 0
_ne2    ENDP
_TEXT   ENDS
PUBLIC  _ne1
; Function compile flags: /Ogtpy
;   COMDAT _ne1
_TEXT   SEGMENT
_n$ = 8                         ; size = 4
_ne1    PROC                        ; COMDAT
; Line 3
    xor eax, eax
    cmp DWORD PTR _n$[esp-4], eax
    setne   al
; Line 4
    ret 0
_ne1    ENDP
_TEXT   ENDS
PUBLIC  _main
; Function compile flags: /Ogtpy
;   COMDAT _main
_TEXT   SEGMENT
_main   PROC                        ; COMDAT
; Line 14
    call    _rand
    call    _rand
    call    _rand
    xor eax, eax
    ret 0
_main   ENDP
_TEXT   ENDS
END

ne2() which used the <, > and || operators is clearly more expensive. ne1() and ne3() which use the == and != operators respectively, are terser and equivalent.

Visual Studio 2011 is in beta. I would consider this as a bug. My tests with two other compilers namely gcc 4.6.2 and clang 3.2, with the O2 optimization switch yielded the exact same assembly for all three tests (that I had) on my Windows 7 box. Here's a summary:

$ cat ne.c

#include <stdbool.h>
bool ne1(int n) {
    return n != 0;
}

bool ne2(int n) {
    return n < 0 || n > 0;
}

bool ne3(int n) {
    return !(n != 0);
}

int main() {}

yields with gcc:

_ne1:
LFB0:
    .cfi_startproc
    movl    4(%esp), %eax
    testl   %eax, %eax
    setne   %al
    ret
    .cfi_endproc
LFE0:
    .p2align 2,,3
    .globl  _ne2
    .def    _ne2;   .scl    2;  .type   32; .endef
_ne2:
LFB1:
    .cfi_startproc
    movl    4(%esp), %edx
    testl   %edx, %edx
    setne   %al
    ret
    .cfi_endproc
LFE1:
    .p2align 2,,3
    .globl  _ne3
    .def    _ne3;   .scl    2;  .type   32; .endef
_ne3:
LFB2:
    .cfi_startproc
    movl    4(%esp), %ecx
    testl   %ecx, %ecx
    sete    %al
    ret
    .cfi_endproc
LFE2:
    .def    ___main;    .scl    2;  .type   32; .endef
    .section    .text.startup,"x"
    .p2align 2,,3
    .globl  _main
    .def    _main;  .scl    2;  .type   32; .endef
_main:
LFB3:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    call    ___main
    xorl    %eax, %eax
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
LFE3:

and with clang:

    .def     _ne1;
    .scl    2;
    .type   32;
    .endef
    .text
    .globl  _ne1
    .align  16, 0x90
_ne1:
    cmpl    $0, 4(%esp)
    setne   %al
    movzbl  %al, %eax
    ret

    .def     _ne2;
    .scl    2;
    .type   32;
    .endef
    .globl  _ne2
    .align  16, 0x90
_ne2:
    cmpl    $0, 4(%esp)
    setne   %al
    movzbl  %al, %eax
    ret

    .def     _ne3;
    .scl    2;
    .type   32;
    .endef
    .globl  _ne3
    .align  16, 0x90
_ne3:
    cmpl    $0, 4(%esp)
    sete    %al
    movzbl  %al, %eax
    ret

    .def     _main;
    .scl    2;
    .type   32;
    .endef
    .globl  _main
    .align  16, 0x90
_main:
    pushl   %ebp
    movl    %esp, %ebp
    calll   ___main
    xorl    %eax, %eax
    popl    %ebp
    ret

My suggestion would be to file this as a bug with Microsoft Connect.

Note: I compiled them as C source since I don't think using the corresponding C++ compiler would make any significant change here.

dirkgently
  • 108,024
  • 16
  • 131
  • 187
  • Good test, but you've put the comparisons inside functions. The OP didn't provide any context and would do well to compile your version. – phkahler May 31 '12 at 18:22
  • Compiling short snippets as independent functions isn't interesting. The optimization that happens after they get inlined at a call site is more important. – Ben Voigt May 31 '12 at 18:23
  • @BenVoigt: I think my tests with VS2010 (a compiler that has been released) seals the issue. Also, my (updated) source (that I used for VS2010) should answer whatever questions you have regarding independent functions. – dirkgently May 31 '12 at 18:27
  • 1
    Your new test is messed up, the compiler performed constant propagation, since it determined `n = 10` always. And then, on top of that, it eliminated the function calls entirely, since the result wasn't used and there are no side effects. – Ben Voigt May 31 '12 at 18:28
  • shouldn't ne3 be `!(n == 0)`? (You have `!(n != 0)`) – Matt May 31 '12 at 18:30
  • @BenVoigt: Updated. This should suffice. I stand by my point that this is a bug than a rule. – dirkgently May 31 '12 at 18:37
  • @BenVoigt: Also, refrain from saying stuff like *Compiling short snippets as independent functions isn't interesting.* IMHO, that is the de-facto rule of posting on boards/working through issues. Also, that is how you test compilers. – dirkgently May 31 '12 at 18:52
  • 1
    @dirkgently: When it comes to optimizer questions, context is everything. – Ben Voigt May 31 '12 at 19:24
  • @dirkgently Assembly output: http://anonymouse.org/cgi-bin/anon-www.cgi/http://pastebin.com/DVH8ex2P If I use a constant instead of rand() these instuctions get optimized into something different. However this shouldn't matter since I also used std::generate() with rand() in my benchmarking code and it used the same exact instructions. So what should I think of this? An optimization that becomes a bug? – NFRCR May 31 '12 at 19:25
  • @NFRCR: I just ran your code with VS2010 `cl` on a x86. With both `NON_ZERO` and `ZERO` macros (separately of course!). The difference is ~1%. And I say difference because the numbers swing either way. The reason I used C was I didn't want to wade through the mangled names and the horror of iostreams in assembly. Also, note that a particular bit pattern may generate strange results (which is another reason for using `rand()` -- though `rand()` is not free from its failings). – dirkgently May 31 '12 at 19:34
  • @dirkgently If you have more time you could move zero1, zero2, nonZero1, nonZero2 to a separate compilation unit (Test.h + Test.cpp perhaps). It would then be easier to locate the assemly output for these functions in Test.asm. – NFRCR May 31 '12 at 19:37
  • @NFRCR: I think the code is weighed down by quite a few other things. Did you try the code I posted? I could move your code around but I don't really see the point. Logically speaking, when the compiler has nothing else to work on, I see direct equality checks to be more efficient than two potential ordering checks (and an optional `OR`). My suggestion will be to ask the VS team directly. – dirkgently May 31 '12 at 19:40
  • @dirkgently Yeah, I tried your code. The results are a few comments up. – NFRCR May 31 '12 at 19:42
  • 7
    IT'S NOT A BUG! How can it be a bug if the compiled code behaves as it should? It shows that there is room for improvement in the optimiser, but there's room for improvement in every optimiser. (That's a theorem, by the way.) – TonyK May 31 '12 at 19:45
  • @TonyK: That is exactly what I call a bug. An obvious (one instruction compared to three -- *as written by the programmer*) one at that too. Also, try this with a few compilers. Then you start thinking about whether this is a bug or not. – dirkgently May 31 '12 at 19:49
  • @NFRCR: The assembly, I believe? That then sort of settles it for me. It's not even a bug. It is behaving exactly as it should. You may have run into a special scenario. But then, as I keep saying, talk to the VS team. They are great, prompt and I'm sure you'd have a far more reliable answer from them. – dirkgently May 31 '12 at 19:51
  • 1
    Visual C++ bugs may be reported on [Microsoft Connect](http://connect.microsoft.com/visualstudio). – James McNellis May 31 '12 at 20:04
  • 1
    Also, it would be worth testing this with the [Visual C++ 2012 RC](http://www.microsoft.com/visualstudio/11/en-us/downloads) that was just released today. – James McNellis May 31 '12 at 20:12
  • FYI: I tested the OP's code with clang and gcc and again the differences were ~1% (or lower). – dirkgently May 31 '12 at 20:15
  • dirkgently: So a program that has room for improvement is *exactly what you call a bug*? We are not speaking the same language! – TonyK May 31 '12 at 20:36
  • @TonyK: IIUC, you were talking about the optimizer in your previous comment. That is what I was referring to in my earlier comment. – dirkgently May 31 '12 at 20:37
  • 1
    dirkgently: Yes. An optimiser has a bug if it changes the behaviour of the program it is optimising. That is not the case here. All optimisers have room for improvement; as I think I already said, this is a theorem. Of mathematics. – TonyK May 31 '12 at 20:41
  • @TonyK: It would also be a bug to degrade the performance of one instruction compared to that of three. – dirkgently May 31 '12 at 20:43
  • dirkgently: You're just not listening to me, are you? Oh well. – TonyK May 31 '12 at 20:44
  • I am not sure that this is really a bug. See my comment in response to the original question for conflicting benchmark data. – James McNellis Jun 01 '12 at 06:26
  • @JamesMcNellis: I'd updated my answer a while back once I had a look at the assembly posted by OP and my tests with OP's code using three different compilers. Check out the edit at the top. I too don't think that there one version is better than the other. – dirkgently Jun 01 '12 at 09:03