2

Which way in below tests is the most preferred in terms of dealing with undefined behavior, auto-vectorization (for struct of arrays) and portability (clang,gcc,msvc,icc)?

Is there another way of doing same operation?

#include <iostream>
#include <cstring>

union trick1
{
  float fvar;
  int ivar;
};

struct trick2
{
  float fvar;
  int ivar()
  {
      int result;
      std::memcpy(&result,&fvar,sizeof(float));
      return result;
  }
};

struct trick3
{
    float fvar;
    int ivar()
    {
        int result=0;
        asm ("mov %0,%0"
         : "=r" (result)
         : "0" (fvar));
        return result;
    }
     
};

struct trick4
{
    float fvar;
    int ivar()
    {
        int result;
        result = *reinterpret_cast<int*>(&fvar);
        return result;
    }
};

int main()
{
    trick1 test1;
    test1.fvar = 3.14f;
    // 1078523331
    std::cout<<test1.ivar<<std::endl;

    trick2 test2;
    test2.fvar = 3.14f;
    // 1078523331
    std::cout<<test2.ivar()<<std::endl;
    
    trick3 test3;
    test3.fvar = 3.14f;
    // 1078523331
    std::cout<<test3.ivar()<<std::endl;  
    
    trick4 test4;
    test4.fvar = 3.14f;
    // 1078523331
    std::cout<<test4.ivar()<<std::endl;  
    return 0;
}

For example, is memcpy ok for converting array of floats to array of integers bitwise?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • 1
    If you have C++20 use `std::bitcast` (this is also `constexpr`), otherwise the only portable save way is using `memcpy` (there must be a duplicate of this question ...) – chtz May 02 '22 at 09:36
  • So bitcast is both autovectorizable and non-ub and has suppoet from icc gcc clang msvc? – huseyin tugrul buyukisik May 02 '22 at 09:37
  • 3
    I would just add that compilers are usually clever enough to optimize the `memcpy` case such that no unnecessary memory operations are involved: https://godbolt.org/z/8rGn33qG4. – Daniel Langr May 02 '22 at 09:39
  • For a simple function like that they look identical and register-based rather than memory copying. But often programs are much more complex and code bloat is added by compiler and I'm afraid of unnecessary register move commands replaced by memory move commands. – huseyin tugrul buyukisik May 02 '22 at 09:45
  • 2
    Demo comparing `memcpy` and `bit_cast` approaches: https://godbolt.org/z/c5sMzMeKh. Note that clang created the same machine code that calls `memcpy` internally. GCC used explicit vectorization with `bit_cast`. Hard to say without benchmarking what would be faster. – Daniel Langr May 02 '22 at 09:49
  • C++20 has a lot of good things like std::assume_aligned. – huseyin tugrul buyukisik May 02 '22 at 09:51

1 Answers1

9
  • trick1 (union): Undefined behaviour in ISO C++, unlike ISO C99.
    The C++ compilers you mentioned support it as an extension in C++.

  • trick2 (std::memcpy) is your best choice before C++20: Well defined with the precondition that sizeof(int) == sizeof(float), but not as simple as std::bit_cast. Mainstream compilers handle it efficiently, not actually doing an extra copy of anything (effectively optimizing it away), as long as the copy size is a single primitive type and writes all the bytes of the destination object.

  • trick3 (inline asm): Non-standard; not portable (neither CPU arch nor compiler). Seriously hinders optimisation, including auto-vectorization.

  • trick4 (deref a reinterpret_cast pointer): Undefined behaviour in ISO C++, and in practice on many real compilers (notably GCC and Clang), unless you compile with gcc -fno-strict-aliasing.


I recommend C++20 std::bit_cast when applicable. It's as efficient as memcpy, and cleaner syntax:

return std::bit_cast<int>(fvar);
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
eerorika
  • 232,697
  • 12
  • 197
  • 326
  • Do you think memcpy is autovectorized for highest width like 512 bit at a time in a cpu of avx512? – huseyin tugrul buyukisik May 02 '22 at 09:39
  • @huseyintugrulbuyukisik I wouldn't make assumptions about autovectorisation. Try it out and read the assembly. – eerorika May 02 '22 at 09:39
  • Linus Torvalds has said something about avx512 and wished it dies a painful death just because someone implemented avx512 version for memcpy. Thats why I added memcpy to list. – huseyin tugrul buyukisik May 02 '22 at 09:41
  • Is std::copy better than std::memcpy for this? (by converting pointer to array first) – huseyin tugrul buyukisik May 02 '22 at 09:43
  • 1
    @huseyintugrulbuyukisik It would work just as well. It would be more verbose due to needing reinterpret casts. – eerorika May 02 '22 at 09:44
  • 1
    @huseyintugrulbuyukisik: `std::copy` with type-punned pointers would only be safe if you cast to `char*`, not to `int*` or `float*`. (Or to `float (*)[1]` pointer to array like you said, although IDK why you'd want that). **Use `memcpy` if `std::bit_cast` isn't available; compilers understand it just fine as a type-pun**, better than a `char*` copy loop like you might get from std::copy. e.g. GCC defines memcpy by default as `__builtin_memcpy`, which is handled as well as you'd hope when the size is the same as the type-width of what you're copying. – Peter Cordes May 02 '22 at 09:54
  • 2
    For compilers that define the behaviour of union type-punning (most but not all actual C++ compilers) or reinterpret casts (just MSVC and ICC, or `gcc -fno-strict-aliasing`), those are as good as memcpy. – Peter Cordes May 02 '22 at 09:54
  • 1
    @huseyintugrulbuyukisik https://stackoverflow.com/questions/4707012/is-it-better-to-use-stdmemcpy-or-stdcopy-in-terms-to-performance – Daniel Langr May 02 '22 at 09:55
  • So, if I had to implement Quake-3 inverse square root that has floating-point bitwise conversion for a whole array of floats, it is better to do std::bit_cast in C++20 and std::memcpy in all others. (just for experimenting the issue, not for performance) – huseyin tugrul buyukisik May 02 '22 at 09:57
  • 2
    @huseyintugrulbuyukisik: yes, that's least likely to hurt auto-vectorization of the algorithm using it. BTW, you don't want this single-element memcpy to be "vectorized", you want it to optimize away entirely, e.g. using SIMD-integer shifts in XMM/YMM/ZMM registers if you were doing a legacy bithack fastinvsqrt instead of using an `rsqrtps` intrinsic. – Peter Cordes May 02 '22 at 10:03
  • @PeterCordes yes it's about struct of arrays, I used trick1-4 as they are simpler to read. Also compiler does not use rsqrtps until -ffast-math is added. Just 1.0f/std::sqrt is not enough. – huseyin tugrul buyukisik May 02 '22 at 10:05
  • 1
    @huseyintugrulbuyukisik: Trick 4 is undefined behaviour on multiple mainstream compilers (GCC / clang), don't use it. And you'd normally want to avoid telling a compiler to copy a whole struct of arrays; it might actually copy them since it's too large to just optimize away the entire mem-to-mem copy and load into the other type of register. That's a totally different question from what you asked, punning one `float` at a time. – Peter Cordes May 02 '22 at 10:07
  • @PeterCordes godbolt.org runtime was crashing with the trick4. Now it looks like it doesn't. I didn't know what was the cause and I was not using that function anywhere. Just existence can crash it like "program returned 132" error? – huseyin tugrul buyukisik May 02 '22 at 10:09
  • 1
    @huseyintugrulbuyukisik: No, UB in a function you never call isn't allowed to crash your program. That would be a compiler bug. You said "runtime", so I guess you mean running the program not an internal compiler error. – Peter Cordes May 02 '22 at 10:12
  • 1
    @huseyintugrulbuyukisik: Of course `1.0f / sqrt` doesn't compile to just an `rsqrtps`, even with `-ffast-math`, it does that plus a Newton iteration. That's not what I said. You would need to use an intrinsic, like `_mm_rsqrt_ps` if you want much lower precision than actual `sqrt`. I don't know of a compiler option to treat `1/sqrt()` as allowing about half the precision. There might be a portable library with a function that uses rsqrt on x86 or equivalent on other ISAs, where something like that exists, but I don't know of one unfortunately. Maybe SIMDe has portable intrinsics... – Peter Cordes May 02 '22 at 10:15
  • 1
    @eerorika: trick 3 is so nasty that it's not even portable to clang: internal compiler error when asking for a float in an integer register. https://godbolt.org/z/1erTMEe5K. GCC does do it as a type-pun, not an FP->int conversion, but I had to check to be sure! And of course it totally defeats all possibility of optimization, including constant-propagation, value-range stuff, or auto-vectorization. And even in the best case, wastes a `mov` instruction. Your answer could say way nastier things about it than "not portable", like "total garbage". – Peter Cordes May 02 '22 at 10:33
  • `std::bit_cast` is a C++20 only solution, while the question also seems to ask for older versions. So "when applicable" is very limited, as most applications likely don't use C++20 yet. – JHBonarius May 03 '22 at 07:17