8

If I compile the following code with Clang 3.3 using -O3 -fno-vectorize I get the same assembly output even if I remove the commented line. The code type puns all possible 32-bit integers to floats and counts the ones in a [0, 1] range. Is Clang's optimizer actually smart enough to realize that 0xFFFFFFFF when punned to float is not in the range [0, 1], so ignore the second call to fn entirely? GCC produces different code when the second call is removed.

#include <limits>
#include <cstring>
#include <cstdint>

template <class TO, class FROM>
inline TO punning_cast(const FROM &input)
{
    TO out;
    std::memcpy(&out, &input, sizeof(TO));
    return out;
}

int main()
{
    uint32_t count = 0;

    auto fn = [&count] (uint32_t x) {
        float f = punning_cast<float>(x);
        if (f >= 0.0f && f <= 1.0f)
            count++;
    };

    for(uint32_t i = 0; i < std::numeric_limits<uint32_t>::max(); ++i)
    {
        fn(i);
    }
    fn(std::numeric_limits<uint32_t>::max()); //removing this changes nothing

    return count;
}

See here: http://goo.gl/YZPw5i

Chris_F
  • 4,991
  • 5
  • 33
  • 63
  • 1
    Is `count` correct in both cases? – ildjarn May 29 '14 at 05:29
  • 1
    http://stackoverflow.com/questions/23838661/why-is-clang-optimizing-this-code-out – rici May 29 '14 at 05:33
  • Clang has a habit of massively optimizing out constant-only functions (effectively doing a quite sophisticated constant folding on them). [Figure 1.](http://stackoverflow.com/questions/15114140/writing-binary-number-system-in-c-code) – The Paramagnetic Croissant May 29 '14 at 05:40
  • somewhat relevant http://blog.regehr.org/archives/959 – 9dan May 29 '14 at 05:43
  • I think, the main point is how deeply the compiler understand the internal operations of memcpy. – 9dan May 29 '14 at 05:45
  • 1
    @9dan Nowadays in modern C libraries and compilers, `memcpy` is almost always a compiler intrinsic function. – The Paramagnetic Croissant May 29 '14 at 05:46
  • 2
    @9dan: It's less "understanding the internals of memcpy" (which might require the compiler to understand the hand-optimized library implementation), and more "understanding the intended function of memcpy". C/C++ allows you to essentially perform any optimization you like, provided the result is unchanged with respect to the specification. Since `memcpy` is specified by C/C++, it can be in principle optimized in any way provided the result is the same. – nneonneo May 29 '14 at 05:58

1 Answers1

11

Yes, it looks like Clang really is this smart.

Test:

#include <limits>
#include <cstring>
#include <cstdint>

template <class TO, class FROM>
inline TO punning_cast(const FROM &input)
{
    TO out;
    std::memcpy(&out, &input, sizeof(TO));
    return out;
}

int main()
{
    uint32_t count = 0;

    auto fn = [&count] (uint32_t x) {
        float f = punning_cast<float>(x);
        if (f >= 0.0f && f <= 1.0f)
            count++;
    };

    for(uint32_t i = 0; i < std::numeric_limits<uint32_t>::max(); ++i)
    {
        fn(i);
    }
#ifdef X
    fn(0x3f800000); /* 1.0f */
#endif

    return count;
}

Result:

$ c++ -S -DX -O3 foo.cpp -std=c++11 -o foo.s
$ c++ -S -O3 foo.cpp -std=c++11 -o foo2.s
$ diff foo.s foo2.s
100d99
<   incl    %eax

Observe that Clang has converted the call to fn(0x3f800000) into simply an increment instruction, since the value decodes to 1.0. This is correct.

My guess is that Clang is tracing the function calls because they only involve constants, and that Clang is capable of tracing memcpy through type-punning (probably by simply emulating its effect on the constant value).

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • 2
    In that case I am almost surprised that Clang doesn't compile the whole thing to `movl $1065353217, %eax` – Chris_F May 29 '14 at 05:40
  • 6
    @Chris_F: I suspect that would take excessively long. Clang likely has some heuristic limits on the amount of tracing it is willing to do (otherwise compile times could easily go through the roof for no clear benefit). – nneonneo May 29 '14 at 05:45
  • 1
    Ah, that makes a lot of sense now that I try thinking about it. It takes several seconds to run on a fast processor, so if it did this kind of thing everywhere, nothing would ever finish compiling. – Chris_F May 29 '14 at 05:47