Clang giving very different performance for expressions which intuitively should be equivalent

Question

Could someone explain me these considerable performance differences between these expressions which I would expect to give similar performance. I'm compiling with Apple LLVM version 5.1 (clang-503.0.38) (based on LLVM 3.4svn) in release mode.

Here's my test code (just change CASE to 1, 2, 3 or 4 to test yourself):

#include <iostream>
#include <chrono>

#define CASE 1

inline int foo(int n) {
    return
#if CASE == 1
    (n % 2) ? 9 : 6

#elif CASE == 2
    (n % 2) == true ? 9 : 6

#elif CASE == 3
    6 + (n % 2) * 3

#elif CASE == 4
    6 + bool(n % 2) * 3

#endif
    ;
}

int main(int argc, const char* argv[])
{
    std::chrono::time_point<std::chrono::system_clock> start, end;
    start = std::chrono::system_clock::now();

    int n = argc;
    for (int i = 0; i < 100000000; ++i) {
        n += foo(n);
    }

    end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;

    std::cout << "elapsed time: " << elapsed_seconds.count() << "\n";
    std::cout << "value: " << n << "\n";

    return 0;
}

And here are the timings I get:

CASE   EXPRESSION                TIME
1      (n % 2) ? 9 : 6           0.1585
2      (n % 2) == true ? 9 : 6   0.3491
3      6 + (n % 2) * 3           0.2559
4      6 + bool(n % 2) * 3       0.1906

Here's the difference in assembly between CASE 1 and CASE 2:

CASE 1:

Ltmp12:
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: main:argv <- RSI
    ##DEBUG_VALUE: i <- 0
    .loc    1 24 0                  ## /Test/main.cpp:24:0
    movl    %ebx, %ecx
    andl    $1, %ecx
    leal    (%rcx,%rcx,2), %ecx
Ltmp13:
    .loc    1 48 14                 ## /Test/main.cpp:48:14
    leal    6(%rbx,%rcx), %ebx

CASE 2:

Ltmp12:
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: main:argv <- RSI
    ##DEBUG_VALUE: i <- 0
    .loc    1 24 0                  ## /Test/main.cpp:24:0
    movl    %ebx, %ecx
    shrl    $31, %ecx
    addl    %ebx, %ecx
    andl    $-2, %ecx
    movl    %ebx, %edx
    subl    %ecx, %edx
    cmpl    $1, %edx
    sete    %cl
    movzbl  %cl, %ecx
    leal    (%rcx,%rcx,2), %ecx
Ltmp13:
    .loc    1 48 14                 ## /Test/main.cpp:48:14
    leal    6(%rbx,%rcx), %ebx

And here's the difference in assembly between CASE 3 and CASE 4:

CASE 3:

Ltmp12:
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: main:argv <- RSI
    ##DEBUG_VALUE: i <- 0
    .loc    1 24 0                  ## /Test/main.cpp:24:0
    movl    %ebx, %ecx
    shrl    $31, %ecx
    addl    %ebx, %ecx
    andl    $-2, %ecx
    movl    %ebx, %edx
    subl    %ecx, %edx
    leal    (%rdx,%rdx,2), %ecx
Ltmp13:
    .loc    1 48 14                 ## /Test/main.cpp:48:14
    leal    6(%rbx,%rcx), %ebx

CASE 4:

Ltmp12:
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: main:argv <- RSI
    ##DEBUG_VALUE: i <- 0
    .loc    1 24 0                  ## /Test/main.cpp:24:0
    movl    %ebx, %ecx
    andl    $1, %ecx
    negl    %ecx
    andl    $3, %ecx
Ltmp13:
    .loc    1 48 14                 ## /Test/main.cpp:48:14
    leal    6(%rbx,%rcx), %ebx

And your compilation flags are?? It's an absolutely pointless comparison without knowing which optimizations were applied. So far - different number of assembly instructions give different execution time, that much should be obvious. — berkus, Apr 02 '14 at 17:54
I'm using Xcode, so I don't easily see all the flags, but one important flag I can see is -Os — , Apr 02 '14 at 18:02
What does `-O2` look like? Otherwise, it looks like optimizer fail. — Mysticial, Apr 02 '14 at 18:03
For both, clang++ and g++ at coliru (`-O3`), `(n % 2) != false ? 9 : 6` is significantly faster than `(n % 2) == true ? 9 : 6` — dyp, Apr 02 '14 at 18:11
@dyp That's very interesting! Looks like an optimizer corner case. — berkus, Apr 02 '14 at 18:13
@Berkus `bool(n % 2) != false ? 9 : 6` is as fast as both `bool(n % 2) == true ? 9 : 6` and `(n % 2) ? 9 : 6` — dyp, Apr 02 '14 at 18:16
@dyp: I shared it (with O2) for easier review [here](http://coliru.stacked-crooked.com/a/0e166cc6d2b1ff33) They all get the same IR (`%1 = and i32 %n, 1` + `%2 = icmp ne i32 %1, 0`) except `(n % 2) == true ? 9 : 6` which gets `%1 = srem i32 %n, 2` + `%2 = icmp eq i32 %1, 1` for some reason. — Matthieu M., Apr 02 '14 at 18:24
@berkus Those snippets are different. `bool(n % 2) == true ? 9 : 6` contains an explicit conversion to `bool`, whereas `(n % 2) == true ? 9 : 6` does not. The first one is significantly faster than the second. — dyp, Apr 02 '14 at 19:09
@MatthieuM. Hmm.. Note this problem is not clang-specific, coliru's g++4.8.2 behaves similarly. — dyp, Apr 02 '14 at 19:10
@dyp: Yes, I am just better at reading LLVM IR so I prefer Clang here; I cannot understand the issue though :/ — Matthieu M., Apr 03 '14 at 06:58
@dyp So in second case it's actually doing an integer compare to true? — berkus, Apr 03 '14 at 14:01
I'm curious: what is the time (and the assembly) you obtain for `6+(n&1)+((n&1)<<1)`? and for `3*(2+(n&1))`? :D — Massa, Apr 03 '14 at 16:20

score 4 · Accepted Answer · edited May 23 '17 at 12:20

This answer currently only covers the difference between the first two cases.

What are the possible values of (n % 2)? Surely, it's 0 and 1, right?

Wrong. It's 0, 1 and -1. Because n is a signed integer, and the result of % can be negative.

(n % 2) ? 6 : 9 implicitly converts the expression n % 2 to bool. The result of this conversion is true IFF the value nonzero. The conversion therefore is equivalent to (n % 2) != 0.

In, (n % 2) == true ? 6 : 9, for the comparison (n % 2) == true, the usual arithmetic conversions are applied to both sides (note that bool is an arithmetic type). true is promoted to an int of value 1. So the conversion is equivalent to (n % 2) == 1.

The two conversions (n % 2) != 0 and (n % 2) == 1 yield different results for a negative n: Let n = -1. Then n % 2 == -1, and -1 != 0 is true, but -1 == 1 is false.

Therefore, the compiler has to introduce some additional complexity to cope with the sign.

The difference in runtime disappears if you make n an unsigned integer, or remove the sign issue any other way (e.g. by comparing n % 2 != false).

I got this idea by looking at the assembly output, especially the following line:

shrl    $31, %eax

Using the highest bit made no sense to me first, until I realized the highest bit is used as the sign.

There's also no difference between cases 3 and 4 when using an `unsigned int`. Both are slower than cases 1 and 2, though. — dyp, Apr 03 '14 at 16:19

Clang giving very different performance for expressions which intuitively should be equivalent

1 Answers1