Why is Clang optimizing this code out?

Question

The purpose of the code is to find the total number of 32-bit floating point bit patterns which represent values between 0 and 1. It seems to me this should work, but for some reason the assembly output from Clang is basically the equivalent of return 0;.

I compiled this with Clang 3.3 and Clang 3.4.1, using -std=c++1y -Wall -Wextra -pedantic -O2 and -std=c++1y -Wall -Wextra -pedantic -O3

Clang 3.4 optimizes everything away with -O2 and -O3.

Clang 3.3 only optimizes everything away with -O3.

By "optimizes everything away" I mean that this is the assembly output of the program:

main:                                   # @main
    xorl    %eax, %eax
    ret

#include <limits>
#include <cstring>
#include <cstdint>

template <class TO, class FROM>
inline TO punning_cast(const FROM &input)
{
    TO out;
    std::memcpy(&out, &input, sizeof(TO));
    return out;
}

int main()
{
    uint32_t i = std::numeric_limits<uint32_t>::min();
    uint32_t count = 0;

    while (1)
    {
        float n = punning_cast<float>(i);
        if(n >= 0.0f && n <= 1.0f)
            count++;
        if (i == std::numeric_limits<uint32_t>::max())
            break;
        i++;
    }

    return count;
}

Why so involved, and not just using [`std::next_after`](http://en.cppreference.com/w/cpp/numeric/math/nextafter)? (But +1 for staing your goal before jumping into code.) — Kerrek SB, May 23 '14 at 21:38
@KerrekSB I suppose I could try that method, but I still think the better question is why does Clang think this code does nothing useful. — Chris_F, May 23 '14 at 21:40
It would be helpful if your code compiled and you included your compilation options. — Andrew Marshall, May 23 '14 at 21:43
my best guess is that you have UB somewhere. If the compiler asserts you got UB on a path, he has absolutely no obligations on that path (not even to evaluate the correct code up until the UB). — bolov, May 23 '14 at 21:48
It compiles for me, but it's not reproducible for any optimization level. (EDIT: I see now they are using Clang 3.3 and Clang 4.4. I'm on Clang 3.2) — BonzaiThePenguin, May 23 '14 at 21:49
I think that your program violates the C++ strict aliasing rule, thus exhibiting UB. I’m not that versed in that matter to know for sure. — bolov, May 23 '14 at 21:52
@bolov That was my first guess too, but I don't see anything invalid in this program, and clang doesn't report anything with `-fsanitize=undefined`. Using `memcpy` like this is not a violation of the aliasing rules. — , May 23 '14 at 21:52
@Chris_F I'm getting [no output] for clang 3.2, when the same version of clang on my system compiles it correctly? Are you sure their system is fully correct? — BonzaiThePenguin, May 23 '14 at 21:54
@bolov Using memcpy to copy the contents of a float to an int does not violate the strict aliasing rule. — Chris_F, May 23 '14 at 21:55
Looks like changing the condition to `i == std::numeric_limits::max()+2` returns 2. So it looks like for some reason, the compiler thinks `std::numeric_limits::max()` is 0, or that the condition `n >= 0.0f && n <= 1.0f` is always true — Danra, May 23 '14 at 21:56
@PaulMcKenzie Then the last possible value gets skipped. Granted, that value won't be between 0 and 1 when reinterpreted as a float. — Brilliand, May 23 '14 at 22:02
@Brilliand - It isn't strict correctness I was after. It is now that code is being generated instead of having the loop optimized away. Just maybe this is a clue as to what is actually occurring. — PaulMcKenzie, May 23 '14 at 22:03
@PaulMcKenzie same effect if you just change the condition to `i == std::numeric_limits::max()-1` — Danra, May 23 '14 at 22:04
Well that's interesting. Moving the check `!= max` before the conversion: http://coliru.stacked-crooked.com/a/293b1d25c74cbac7 — dyp, May 23 '14 at 22:04
I think you've had enough validation to go post it on llvm's bugzilla. — Marc Glisse, May 23 '14 at 22:07
@dyp That's exactly the same thing as moving it after the `i++` like PaulMcKenzie commented. It's an infinite loop, so the only difference would be whether the condition is evaluated on the first iteration, but for the first iteration, the condition is known to be false (in a way the compiler really should be able to detect) — , May 23 '14 at 22:07
since you are generating all bit patterns one of it will be NaN. is comparing NaN with a non-NaN float (i.e. 0.0f and 1.0f) UB? — bolov, May 23 '14 at 22:08
@bolov only sNaN would be an issue, but I would be surprised if that was the problem here. — Marc Glisse, May 23 '14 at 22:09
@bolov - I think you have something. Remove the call to the template, and just replace it with `float n = i;`. — PaulMcKenzie, May 23 '14 at 22:11
What platform are you compiling this for and what platform are you compiling it on? — George, May 23 '14 at 22:12
@PaulMcKenzie That doesn't do the same thing. They are checking every possible bit pattern for floats to see how many are between 0.0 and 1.0. — BonzaiThePenguin, May 23 '14 at 22:13
As far as I know, the punning has to use an intermediary `char` array. But this doesn't seem to influence this issue. — dyp, May 23 '14 at 22:13
@BonzaiThePenguin Again, I'm looking for the compiler to generate something. I am not after an exact duplication of the code that is in question. — PaulMcKenzie, May 23 '14 at 22:14
@dyp the OP is punning with `memcpy` which is perfectly legal. — Mark B, May 23 '14 at 22:15
@PaulMcKenzie But precalculating the final value of count is a trivial operation in that situation (return 2), so I'm not sure how it would help? — BonzaiThePenguin, May 23 '14 at 22:15
@BonzaiThePenguin No, not the count. Keep the count where the OP had placed it. Just remove the call to the template and replace with a simple assignment. The issue seems to be the template. — PaulMcKenzie, May 23 '14 at 22:16
Is `reinterpret_cast` available? (I'm wondering if it behaves differently from `punning_cast`.) — Brilliand, May 23 '14 at 22:16
@PaulMcKenzie What I'm trying to say is that by replacing the call to the template with a simple assignment, the compiler will be able to precalculate the final value for count. — BonzaiThePenguin, May 23 '14 at 22:17
The problem is not the template (or at least not this specific template...), doing `float n = *reinterpret_cast(&i);` instead of the punning cast gives the same result — Danra, May 23 '14 at 22:20
Maybe the problem isn't the template, but it seems to have something to do with casting/assigning/copying from the `uint32_t` to a `float`. — PaulMcKenzie, May 23 '14 at 22:22
It looks like a compiler bug. There's nothing UB about the code... — Kerrek SB, May 23 '14 at 22:23
Isn't NaN represented by all bits on? If so, then that's UB that's skipped by moving the final check. — Brilliand, May 23 '14 at 22:24
It's quite interesting that on my x86_64, and with a slightly modified condition to not trigger the bad behaviour, Clang produces vastly better code that GCC 4.8: Clang's code executes under 5 Bn cycles, while GCC's code needs about 15 Bn cycles (13 Bn with `-funroll-loops`); both with `-O3 -march=native`. — Kerrek SB, May 23 '14 at 22:25
@Brilliand: What's UB about NaN? (There are many representations for NaN in IEE754.) — Kerrek SB, May 23 '14 at 22:26
I have a *strong* suspicion that it's related to http://stackoverflow.com/questions/3592557/optimizing-away-a-while1-in-c0x but I can't see quite how. — Mark B, May 23 '14 at 22:28
@KerrekSB Just basing it on an earlier comment in this thread... but I just looked it up, and all bits set isn't sNaN, so that's not it. — Brilliand, May 23 '14 at 22:30
@Brilliand NaN is defined as 0x7F for the high byte, and non-zero for the lower three bytes. — BonzaiThePenguin, May 23 '14 at 22:30
@MarkB I see the same behaviour if I change it to a regular `i = 0; do { ... } while (++i);` loop. — , May 23 '14 at 22:31
Did you try expanding the pruning_cast body in place of the call to pruning_cast? — cppguy, May 23 '14 at 22:32
Can OP use `volatile` here? I read it prevents certain optimisations on Wiki.. — Brandon, May 23 '14 at 22:32
@Brilliand And then +INF and -INF are defined as 0x7F and 0xFF for the high byte, respectively, and zero for the lower three bytes. All ones is valid, so it would not be fully correct to skip the last iteration. — BonzaiThePenguin, May 23 '14 at 22:40
I'm pretty sure this has nothing to do with NaNs, if it did, then I think this version would exhibit the same behavior. http://goo.gl/qOlF4T — Chris_F, May 23 '14 at 22:43
Even though this is looking like a clang bug, you might want to further classify the float bit pattern with [`std::isfinite`](http://en.cppreference.com/w/cpp/numeric/math/isfinite) or a similar function. Not sure there aren't subtle issues in the range check when encountering NaN patterns, for example. Or just go with @KerrekSB's suggestion, and use `std::nextafter`. — Brett Hale, May 23 '14 at 23:59
Actually, I get `(1065353218)` with or without the additional `std::isfinite` check - of which `(8388609)` are subnormal. There are `(16777214)` NaN patterns, which return false in all comparisons anyway (which I'd overlooked). And `(2)` INF patterns. — Brett Hale, May 24 '14 at 00:15
@KerrekSB it could generate a trap representation of float. Also, uninitialized bits may be involved if `sizeof(float) > sizeof(uint32)` — M.M, May 24 '14 at 01:54
@MattMcNabb: possibly, though I ran it through ubsan, too, with no diagnostics to that effect... — Kerrek SB, May 24 '14 at 10:49
Possibly related: http://llvm.org/bugs/show_bug.cgi?id=17288 — Kerrek SB, May 24 '14 at 22:12
@KerrekSB looks very likely due to the bug's description, and since using -fno-vectorize eliminates the issue. Also, the generated code up to the vectorization looks good. — Danra, May 24 '14 at 23:20

mukunda · Accepted Answer · 2015-11-26T15:09:20.923

62

Here's a simpler test case which points out that it's a compiler bug:

http://coliru.stacked-crooked.com/a/58b3f9b4edd8e373

#include <cstdint>

int main()
{
    uint32_t i = 0;
    uint32_t count = 1;

    while (1)
    {
        if( i < 5 )
            count+=1;

        if (i == 0xFFFFFFFF)
            break;
        i++;
    }

    return count; // should return 6
}

The assembly shows that it outputs 1, instead of 6. It doesn't think it's an infinite loop either, in which case the assembly doesn't return from main.

edited Nov 26 '15 at 15:09

answered May 23 '14 at 23:08

mukunda

2,908
15
21

Great example. When I turned this into a `for` then again it seems to do the right thing. So it looks like clang's optimizer makes mistakes on infinite while loops. – Adam May 23 '14 at 23:13
1

Thanks for the simplified test case. I'll go ahead and mark this as the answer since at this point there is probably nothing more useful that could be contributed here. Only thing left to do is file a bug report with this example. – Chris_F May 23 '14 at 23:17
@Adam: The transformation you made to turn it into a `for` loop would be interesting to know. – R.. GitHub STOP HELPING ICE May 24 '14 at 18:21
@Chris_F: Could you post a link to the bug report here once you file it? – R.. GitHub STOP HELPING ICE May 24 '14 at 18:22
2

@R.. `for (uint32_t i = 0; i != 0xFFFFFFFF; i++)` – Adam May 24 '14 at 19:00
@Adam, that for loop won't run for the value 0xFFFFFFFF whereas the while loop will – noggin182 May 27 '14 at 23:32

score 4 · Answer 2 · answered May 23 '14 at 23:05

This is not an answer but a datapoint that's too big for a comment.

Interestingly, if you print count right before the return then clang will still optimize everything out and print 0 with -O3 and 1065353218 with -O0. (Note that echo $? reports that the return value is always 2, no matter what the actual return was). To me, this makes it look like a compiler bug.

If you turn your while into a for:

for (uint32_t i = std::numeric_limits<uint32_t>::min(); i != std::numeric_limits<uint32_t>::max(); ++i)
{
    float n = punning_cast<float>(i);
    if(n >= 0.0f && n <= 1.0f)
        count++;
}

Then the same answer comes out for both optimization levels. Definitely true if you print, and though I haven't looked at the assembly it's likely also true for the unprinted case because it does take time before it finishes. (clang 3.4)

I've found bugs in LLVM before (funny template business that made clang segfault) and they've been responsive in fixing it if you give a nice and clear example of the fault. I suggest you submit this as a bug report.

Danra · Answer 3 · 2014-05-27T09:23:09.163

Using mukunda's example above, in clang 3.4 with -O2 the issue seems to be in the vectorization phase. The vectorized code jumps on entry to past the vectorized code:

br i1 true, label %middle.block, label %vector.ph

so count's value remains unchanged from its initialization.

*** IR Dump Before Combine redundant instructions ***
; Function Attrs: nounwind readnone ssp uwtable
define i32 @main() #0 {
entry:
  br i1 true, label %middle.block, label %vector.ph

vector.ph:                                        ; preds = %entry
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x i32> [ <i32 1, i32 0, i32 0, i32 0>, %vector.ph ], [ %4, %vector.body ]
  %vec.phi8 = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %induction7 = add <4 x i32> %broadcast.splat, <i32 4, i32 5, i32 6, i32 7>
  %0 = icmp ult <4 x i32> %induction, <i32 5, i32 5, i32 5, i32 5>
  %1 = icmp ult <4 x i32> %induction7, <i32 5, i32 5, i32 5, i32 5>
  %2 = zext <4 x i1> %0 to <4 x i32>
  %3 = zext <4 x i1> %1 to <4 x i32>
  %4 = add <4 x i32> %2, %vec.phi
  %5 = add <4 x i32> %3, %vec.phi8
  %6 = icmp eq <4 x i32> %induction, <i32 -1, i32 -1, i32 -1, i32 -1>
  %7 = icmp eq <4 x i32> %induction7, <i32 -1, i32 -1, i32 -1, i32 -1>
  %8 = add <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
  %9 = add <4 x i32> %induction7, <i32 1, i32 1, i32 1, i32 1>
  %index.next = add i32 %index, 8
  %10 = icmp eq i32 %index.next, 0
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !1

middle.block:                                     ; preds = %vector.body, %entry
  %resume.val = phi i32 [ 0, %entry ], [ 0, %vector.body ]
  %trunc.resume.val = phi i32 [ 0, %entry ], [ 0, %vector.body ]
  %rdx.vec.exit.phi = phi <4 x i32> [ <i32 1, i32 0, i32 0, i32 0>, %entry ], [ %4, %vector.body ]
  %rdx.vec.exit.phi9 = phi <4 x i32> [ zeroinitializer, %entry ], [ %5, %vector.body ]
  %bin.rdx = add <4 x i32> %rdx.vec.exit.phi9, %rdx.vec.exit.phi
  %rdx.shuf = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx10 = add <4 x i32> %bin.rdx, %rdx.shuf
  %rdx.shuf11 = shufflevector <4 x i32> %bin.rdx10, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx12 = add <4 x i32> %bin.rdx10, %rdx.shuf11
  %11 = extractelement <4 x i32> %bin.rdx12, i32 0
  %cmp.n = icmp eq i32 0, %resume.val
  br i1 %cmp.n, label %while.end, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block
  br label %while.body
while.body:                                       ; preds = %while.body, %scalar.ph
  %i.0 = phi i32 [ %trunc.resume.val, %scalar.ph ], [ %inc, %while.body ]
  %count.0 = phi i32 [ %11, %scalar.ph ], [ %add.count.0, %while.body ]
  %cmp = icmp ult i32 %i.0, 5
  %add = zext i1 %cmp to i32
  %add.count.0 = add i32 %add, %count.0
  %cmp1 = icmp eq i32 %i.0, -1
  %inc = add i32 %i.0, 1
  br i1 %cmp1, label %while.end, label %while.body, !llvm.loop !4

while.end:                                        ; preds = %middle.block, %while.body
  %add.count.0.lcssa = phi i32 [ %add.count.0, %while.body ], [ %11, %middle.block ]
  ret i32 %add.count.0.lcssa
}

The optimizer later erases unreachable and non-effective code - which is almost the entire function body.

Why is Clang optimizing this code out?

3 Answers3

Linked