7

In a simd-tutorial i found the following code-snippet.

void simd(float* a, int N)                                                                                                                                                                                        
{                      
// We assume N % 4 == 0.                                                                                                                                                                                        
 int nb_iters = N / 4;                                                                                                                                                                                         
 __m128* ptr = reinterpret_cast<__m128*>(a); // (*)                                                                                                                                                                                 

 for (int i = 0; i < nb_iters; ++i, ++ptr, a += 4)                                                                                                                                                              
     _mm_store_ps(a, _mm_sqrt_ps(*ptr));                                                                                                                                                                          
}   

Now my question is, is the line with (*) undefined behaviour? Due to the following spec from (https://en.cppreference.com/w/cpp/language/reinterpret_cast)

Whenever an attempt is made to read or modify the stored value of an object of type DynamicType through a glvalue of type AliasedType, the behavior is undefined unless one of the following is true:

  • AliasedType and DynamicType are similar.
  • AliasedType is the (possibly cv-qualified) signed or unsigned variant of DynamicType.
  • AliasedType is std::byte, (since C++17)char, or unsigned char: this permits examination of the object representation of any object as an array of bytes.

How could someone prevent undefined behaviour in this case? Im aware of that i could std::memcopy, but the performance penalty would made the simd useless or am i'm wrong on this?

Community
  • 1
  • 1
user1235183
  • 3,002
  • 1
  • 27
  • 66
  • 1
    Please consider Peter's answer when choosing which one to accept. While I don't think I said anything wrong in mine, Peter has (as usual) much more pertinent information. – Max Langhof Nov 18 '19 at 09:18

2 Answers2

7

Edit: Please look at the answer in the duplicate (and/or Peter's answer here). What I write below is technically correct but not really relevant in practice.


Yes, that would be undefined behavior based on the C++ standard. Your compiler might still handle it correctly as an extension (seeing as SIMD types and intrinsics are not part of the C++ standard in the first place).

To do this safely and correctly without compromising speed, you would use the intrinsic for loading 4 floats directly from memory into a 128 bit register:

__m128 reg = _mm_load_ps(a);

See the Intel Intrinsics Guide for the important alignment constraint:

__m128 _mm_load_ps (float const* mem_addr)

Load 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

Community
  • 1
  • 1
Max Langhof
  • 23,383
  • 5
  • 39
  • 72
  • 1
    There's also `_mm_loadu_ps()` for unaligned loads. – Shawn Nov 18 '19 at 09:12
  • Part of supporting the intrinsics API is defining the behaviour of `__m128*` pointers as being allowed to alias anything. If `_mm_load_ps` exists, then the OP's code is safe. – Peter Cordes Nov 18 '19 at 09:13
  • 1
    @PeterCordes I'll keep in mind to dupe-search first next time. The dupe is indeed much better. – Max Langhof Nov 18 '19 at 09:19
  • @MaxLanghof: heh, same, I was trying to get an answer in quickly to refute yours before taking the time to search, so this somewhat misleading answer that suggests the tutorial might be broken didn't get all the votes. Turns out it was a perfectly exact dupe; something that rarely happens in asm / SIMD questions. But probably not so rare for language-lawyer. – Peter Cordes Nov 18 '19 at 09:27
5

Intel's intrinsics API does define the behaviour of casting to __m128* and dereferencing: it's identical to _mm_load_ps on the same pointer.

For float* and double*, the load/store intrinsics basically exist to wrap this reinterpret cast and communicate alignment info to the compiler.

If _mm_load_ps() is supported, the implementation must also define the behaviour of the code in the question.


I don't know if this is actually documented anywhere; maybe in an Intel tutorial or whitepaper, but it's the agreed-upon behaviour of all compilers and I think most people would agree that a compiler that didn't define this behaviour didn't fully support Intel's intrinsics API.

__m128 types are defined as may_alias1, so like char* you can point a __m128* at anything, including int[] or an arbitrary struct, and load or store through it without violating strict-aliasing. (As long as it's aligned by 16, otherwise you do need _mm_loadu_ps, or a custom vector type declared with something like GNU C's aligned(1) attribute).


Footnote 1: __attribute__((vector_size(16), may_alias)) in GNU C, and MSVC doesn't do type-based alias analysis.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Was going to link another Q&A about more stuff that Intel intrinsics require implementations to define, but turns out [Is \`reinterpret\_cast\`ing between hardware vector pointer and the corresponding type an undefined behavior?](//stackoverflow.com/q/52112605) is an exact duplicate. – Peter Cordes Nov 18 '19 at 09:19