51

The purpose of the code is to find the total number of 32-bit floating point bit patterns which represent values between 0 and 1. It seems to me this should work, but for some reason the assembly output from Clang is basically the equivalent of return 0;.

I compiled this with Clang 3.3 and Clang 3.4.1, using -std=c++1y -Wall -Wextra -pedantic -O2 and -std=c++1y -Wall -Wextra -pedantic -O3

Clang 3.4 optimizes everything away with -O2 and -O3.

Clang 3.3 only optimizes everything away with -O3.

By "optimizes everything away" I mean that this is the assembly output of the program:

main:                                   # @main
    xorl    %eax, %eax
    ret
#include <limits>
#include <cstring>
#include <cstdint>

template <class TO, class FROM>
inline TO punning_cast(const FROM &input)
{
    TO out;
    std::memcpy(&out, &input, sizeof(TO));
    return out;
}

int main()
{
    uint32_t i = std::numeric_limits<uint32_t>::min();
    uint32_t count = 0;

    while (1)
    {
        float n = punning_cast<float>(i);
        if(n >= 0.0f && n <= 1.0f)
            count++;
        if (i == std::numeric_limits<uint32_t>::max())
            break;
        i++;
    }

    return count;
}
Mysticial
  • 464,885
  • 45
  • 335
  • 332
Chris_F
  • 4,991
  • 5
  • 33
  • 63
  • 9
    Why so involved, and not just using [`std::next_after`](http://en.cppreference.com/w/cpp/numeric/math/nextafter)? (But +1 for staing your goal before jumping into code.) – Kerrek SB May 23 '14 at 21:38
  • 1
    @KerrekSB I suppose I could try that method, but I still think the better question is why does Clang think this code does nothing useful. – Chris_F May 23 '14 at 21:40
  • 1
    Is it the punning_cast that gets optimized out? – harmic May 23 '14 at 21:42
  • 1
    It would be helpful if your code compiled and you included your compilation options. – Andrew Marshall May 23 '14 at 21:43
  • Can you run it unoptimized? (If so, does it work then?) – Brilliand May 23 '14 at 21:43
  • 1
    my best guess is that you have UB somewhere. If the compiler asserts you got UB on a path, he has absolutely no obligations on that path (not even to evaluate the correct code up until the UB). – bolov May 23 '14 at 21:48
  • It compiles for me, but it's not reproducible for any optimization level. (EDIT: I see now they are using Clang 3.3 and Clang 4.4. I'm on Clang 3.2) – BonzaiThePenguin May 23 '14 at 21:49
  • @BonzaiThePenguin See here: http://goo.gl/RTdHN3 – Chris_F May 23 '14 at 21:50
  • I think that your program violates the C++ strict aliasing rule, thus exhibiting UB. I’m not that versed in that matter to know for sure. – bolov May 23 '14 at 21:52
  • 3
    @bolov That was my first guess too, but I don't see anything invalid in this program, and clang doesn't report anything with `-fsanitize=undefined`. Using `memcpy` like this is not a violation of the aliasing rules. –  May 23 '14 at 21:52
  • @Chris_F I'm getting [no output] for clang 3.2, when the same version of clang on my system compiles it correctly? Are you sure their system is fully correct? – BonzaiThePenguin May 23 '14 at 21:54
  • 1
    @bolov Using memcpy to copy the contents of a float to an int does not violate the strict aliasing rule. – Chris_F May 23 '14 at 21:55
  • Looks like changing the condition to `i == std::numeric_limits::max()+2` returns 2. So it looks like for some reason, the compiler thinks `std::numeric_limits::max()` is 0, or that the condition `n >= 0.0f && n <= 1.0f` is always true – Danra May 23 '14 at 21:56
  • Move the increment `i++` before the max() test. – PaulMcKenzie May 23 '14 at 22:01
  • @PaulMcKenzie Then the last possible value gets skipped. Granted, that value won't be between 0 and 1 when reinterpreted as a float. – Brilliand May 23 '14 at 22:02
  • @Brilliand - It isn't strict correctness I was after. It is now that code is being generated instead of having the loop optimized away. Just maybe this is a clue as to what is actually occurring. – PaulMcKenzie May 23 '14 at 22:03
  • @PaulMcKenzie same effect if you just change the condition to `i == std::numeric_limits::max()-1` – Danra May 23 '14 at 22:04
  • Well that's interesting. Moving the check `!= max` before the conversion: http://coliru.stacked-crooked.com/a/293b1d25c74cbac7 – dyp May 23 '14 at 22:04
  • @dyp Also, same effect – Danra May 23 '14 at 22:06
  • 1
    I think you've had enough validation to go post it on llvm's bugzilla. – Marc Glisse May 23 '14 at 22:07
  • @dyp That's exactly the same thing as moving it after the `i++` like PaulMcKenzie commented. It's an infinite loop, so the only difference would be whether the condition is evaluated on the first iteration, but for the first iteration, the condition is known to be false (in a way the compiler really should be able to detect) –  May 23 '14 at 22:07
  • 1
    since you are generating all bit patterns one of it will be NaN. is comparing NaN with a non-NaN float (i.e. 0.0f and 1.0f) UB? – bolov May 23 '14 at 22:08
  • @bolov only sNaN would be an issue, but I would be surprised if that was the problem here. – Marc Glisse May 23 '14 at 22:09
  • @hvd Wow I got lost in the comments. – dyp May 23 '14 at 22:10
  • @bolov - I think you have something. Remove the call to the template, and just replace it with `float n = i;`. – PaulMcKenzie May 23 '14 at 22:11
  • What platform are you compiling this for and what platform are you compiling it on? – George May 23 '14 at 22:12
  • @PaulMcKenzie That doesn't do the same thing. They are checking every possible bit pattern for floats to see how many are between 0.0 and 1.0. – BonzaiThePenguin May 23 '14 at 22:13
  • As far as I know, the punning has to use an intermediary `char` array. But this doesn't seem to influence this issue. – dyp May 23 '14 at 22:13
  • @BonzaiThePenguin Again, I'm looking for the compiler to generate something. I am not after an exact duplication of the code that is in question. – PaulMcKenzie May 23 '14 at 22:14
  • 1
    @dyp the OP is punning with `memcpy` which is perfectly legal. – Mark B May 23 '14 at 22:15
  • @PaulMcKenzie But precalculating the final value of count is a trivial operation in that situation (return 2), so I'm not sure how it would help? – BonzaiThePenguin May 23 '14 at 22:15
  • @BonzaiThePenguin No, not the count. Keep the count where the OP had placed it. Just remove the call to the template and replace with a simple assignment. The issue seems to be the template. – PaulMcKenzie May 23 '14 at 22:16
  • Is `reinterpret_cast` available? (I'm wondering if it behaves differently from `punning_cast`.) – Brilliand May 23 '14 at 22:16
  • @PaulMcKenzie What I'm trying to say is that by replacing the call to the template with a simple assignment, the compiler will be able to precalculate the final value for count. – BonzaiThePenguin May 23 '14 at 22:17
  • The problem is not the template (or at least not this specific template...), doing `float n = *reinterpret_cast(&i);` instead of the punning cast gives the same result – Danra May 23 '14 at 22:20
  • Maybe the problem isn't the template, but it seems to have something to do with casting/assigning/copying from the `uint32_t` to a `float`. – PaulMcKenzie May 23 '14 at 22:22
  • 6
    It looks like a compiler bug. There's nothing UB about the code... – Kerrek SB May 23 '14 at 22:23
  • 1
    Isn't NaN represented by all bits on? If so, then that's UB that's skipped by moving the final check. – Brilliand May 23 '14 at 22:24
  • 1
    It's quite interesting that on my x86_64, and with a slightly modified condition to not trigger the bad behaviour, Clang produces vastly better code that GCC 4.8: Clang's code executes under 5 Bn cycles, while GCC's code needs about 15 Bn cycles (13 Bn with `-funroll-loops`); both with `-O3 -march=native`. – Kerrek SB May 23 '14 at 22:25
  • @Brilliand: What's UB about NaN? (There are many representations for NaN in IEE754.) – Kerrek SB May 23 '14 at 22:26
  • I have a *strong* suspicion that it's related to http://stackoverflow.com/questions/3592557/optimizing-away-a-while1-in-c0x but I can't see quite how. – Mark B May 23 '14 at 22:28
  • @KerrekSB Just basing it on an earlier comment in this thread... but I just looked it up, and all bits set isn't sNaN, so that's not it. – Brilliand May 23 '14 at 22:30
  • 1
    @Brilliand NaN is defined as 0x7F for the high byte, and non-zero for the lower three bytes. – BonzaiThePenguin May 23 '14 at 22:30
  • @MarkB I see the same behaviour if I change it to a regular `i = 0; do { ... } while (++i);` loop. –  May 23 '14 at 22:31
  • Did you try expanding the pruning_cast body in place of the call to pruning_cast? – cppguy May 23 '14 at 22:32
  • Can OP use `volatile` here? I read it prevents certain optimisations on Wiki.. – Brandon May 23 '14 at 22:32
  • @Brilliand And then +INF and -INF are defined as 0x7F and 0xFF for the high byte, respectively, and zero for the lower three bytes. All ones is valid, so it would not be fully correct to skip the last iteration. – BonzaiThePenguin May 23 '14 at 22:40
  • I'm pretty sure this has nothing to do with NaNs, if it did, then I think this version would exhibit the same behavior. http://goo.gl/qOlF4T – Chris_F May 23 '14 at 22:43
  • Even though this is looking like a clang bug, you might want to further classify the float bit pattern with [`std::isfinite`](http://en.cppreference.com/w/cpp/numeric/math/isfinite) or a similar function. Not sure there aren't subtle issues in the range check when encountering NaN patterns, for example. Or just go with @KerrekSB's suggestion, and use `std::nextafter`. – Brett Hale May 23 '14 at 23:59
  • Actually, I get `(1065353218)` with or without the additional `std::isfinite` check - of which `(8388609)` are subnormal. There are `(16777214)` NaN patterns, which return false in all comparisons anyway (which I'd overlooked). And `(2)` INF patterns. – Brett Hale May 24 '14 at 00:15
  • @KerrekSB it could generate a trap representation of float. Also, uninitialized bits may be involved if `sizeof(float) > sizeof(uint32)` – M.M May 24 '14 at 01:54
  • 1
    @MattMcNabb: possibly, though I ran it through ubsan, too, with no diagnostics to that effect... – Kerrek SB May 24 '14 at 10:49
  • 2
    Possibly related: http://llvm.org/bugs/show_bug.cgi?id=17288 – Kerrek SB May 24 '14 at 22:12
  • 1
    @KerrekSB looks very likely due to the bug's description, and since using -fno-vectorize eliminates the issue. Also, the generated code up to the vectorization looks good. – Danra May 24 '14 at 23:20

3 Answers3

62

Here's a simpler test case which points out that it's a compiler bug:

http://coliru.stacked-crooked.com/a/58b3f9b4edd8e373

#include <cstdint>

int main()
{
    uint32_t i = 0;
    uint32_t count = 1;

    while (1)
    {
        if( i < 5 )
            count+=1;

        if (i == 0xFFFFFFFF)
            break;
        i++;
    }

    return count; // should return 6
}

The assembly shows that it outputs 1, instead of 6. It doesn't think it's an infinite loop either, in which case the assembly doesn't return from main.

mukunda
  • 2,908
  • 15
  • 21
4

This is not an answer but a datapoint that's too big for a comment.

Interestingly, if you print count right before the return then clang will still optimize everything out and print 0 with -O3 and 1065353218 with -O0. (Note that echo $? reports that the return value is always 2, no matter what the actual return was). To me, this makes it look like a compiler bug.

If you turn your while into a for:

for (uint32_t i = std::numeric_limits<uint32_t>::min(); i != std::numeric_limits<uint32_t>::max(); ++i)
{
    float n = punning_cast<float>(i);
    if(n >= 0.0f && n <= 1.0f)
        count++;
}

Then the same answer comes out for both optimization levels. Definitely true if you print, and though I haven't looked at the assembly it's likely also true for the unprinted case because it does take time before it finishes. (clang 3.4)

I've found bugs in LLVM before (funny template business that made clang segfault) and they've been responsive in fixing it if you give a nice and clear example of the fault. I suggest you submit this as a bug report.

Adam
  • 16,808
  • 7
  • 52
  • 98
3

Using mukunda's example above, in clang 3.4 with -O2 the issue seems to be in the vectorization phase. The vectorized code jumps on entry to past the vectorized code:

br i1 true, label %middle.block, label %vector.ph

so count's value remains unchanged from its initialization.

*** IR Dump Before Combine redundant instructions ***
; Function Attrs: nounwind readnone ssp uwtable
define i32 @main() #0 {
entry:
  br i1 true, label %middle.block, label %vector.ph

vector.ph:                                        ; preds = %entry
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x i32> [ <i32 1, i32 0, i32 0, i32 0>, %vector.ph ], [ %4, %vector.body ]
  %vec.phi8 = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %5, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %induction7 = add <4 x i32> %broadcast.splat, <i32 4, i32 5, i32 6, i32 7>
  %0 = icmp ult <4 x i32> %induction, <i32 5, i32 5, i32 5, i32 5>
  %1 = icmp ult <4 x i32> %induction7, <i32 5, i32 5, i32 5, i32 5>
  %2 = zext <4 x i1> %0 to <4 x i32>
  %3 = zext <4 x i1> %1 to <4 x i32>
  %4 = add <4 x i32> %2, %vec.phi
  %5 = add <4 x i32> %3, %vec.phi8
  %6 = icmp eq <4 x i32> %induction, <i32 -1, i32 -1, i32 -1, i32 -1>
  %7 = icmp eq <4 x i32> %induction7, <i32 -1, i32 -1, i32 -1, i32 -1>
  %8 = add <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
  %9 = add <4 x i32> %induction7, <i32 1, i32 1, i32 1, i32 1>
  %index.next = add i32 %index, 8
  %10 = icmp eq i32 %index.next, 0
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !1

middle.block:                                     ; preds = %vector.body, %entry
  %resume.val = phi i32 [ 0, %entry ], [ 0, %vector.body ]
  %trunc.resume.val = phi i32 [ 0, %entry ], [ 0, %vector.body ]
  %rdx.vec.exit.phi = phi <4 x i32> [ <i32 1, i32 0, i32 0, i32 0>, %entry ], [ %4, %vector.body ]
  %rdx.vec.exit.phi9 = phi <4 x i32> [ zeroinitializer, %entry ], [ %5, %vector.body ]
  %bin.rdx = add <4 x i32> %rdx.vec.exit.phi9, %rdx.vec.exit.phi
  %rdx.shuf = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx10 = add <4 x i32> %bin.rdx, %rdx.shuf
  %rdx.shuf11 = shufflevector <4 x i32> %bin.rdx10, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx12 = add <4 x i32> %bin.rdx10, %rdx.shuf11
  %11 = extractelement <4 x i32> %bin.rdx12, i32 0
  %cmp.n = icmp eq i32 0, %resume.val
  br i1 %cmp.n, label %while.end, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block
  br label %while.body
while.body:                                       ; preds = %while.body, %scalar.ph
  %i.0 = phi i32 [ %trunc.resume.val, %scalar.ph ], [ %inc, %while.body ]
  %count.0 = phi i32 [ %11, %scalar.ph ], [ %add.count.0, %while.body ]
  %cmp = icmp ult i32 %i.0, 5
  %add = zext i1 %cmp to i32
  %add.count.0 = add i32 %add, %count.0
  %cmp1 = icmp eq i32 %i.0, -1
  %inc = add i32 %i.0, 1
  br i1 %cmp1, label %while.end, label %while.body, !llvm.loop !4

while.end:                                        ; preds = %middle.block, %while.body
  %add.count.0.lcssa = phi i32 [ %add.count.0, %while.body ], [ %11, %middle.block ]
  ret i32 %add.count.0.lcssa
}

The optimizer later erases unreachable and non-effective code - which is almost the entire function body.

Danra
  • 9,546
  • 5
  • 59
  • 117