Converting to and from __m256i and std::vector

Question

I want to convert to and from __m256i instances and std::vector<uint32_t> instances (containing exactly 8 elements).

So far I came up with this:

using vu32 = std::vector<uint32_t>;

__m256i v2v(const vu32& in) {
    assert(in.size() == 8);
    return _mm256_loadu_si256(reinterpret_cast<const __m256i*>(in.data()));
}

vu32 v2v(__m256i in) {
    vu32 out(8);
    _mm256_storeu_si256(reinterpret_cast<__m256i*>(out.data()), in);
    return out;
}

Is it safe?

Is there a more idiomatic way to do it?

If the `vector` has a fixed length, why not `std::array` instead? That might be a bit more idiomatic, and there is less language-lawyering regarding safety since the class is just a wrapper around a plain array. Plus, you won't have the dynamic-allocation overhead that would typically come with `std::vector` (I don't know of any implementations that have a small-vector optimization like many `std::string`s do). — Jason R, Jun 24 '19 at 00:52
@JasonR - I agree, but as an external constraint I must use `std::vector` for this. Agreed in general about the efficiency, although `vector` does have some advantages in cases where moves can be used. — BeeOnRope, Jun 24 '19 at 00:58
Fair enough. From a practical perspective, I don’t see any issue with your example code. I’m sure someone more exoert in C++ standardese can weigh in on whether it’s fully defined according to the standard, but I’ve used very similar constructs countless times before. — Jason R, Jun 24 '19 at 01:09
@PeterCordes - well SIMD vectors and `vector` have a lot to do with each other in the sense that they are both contiguous storage for N elements of a given type. SIMD tends to have a fixed N and the types are more flexible (e.g., you can essentially change the type from operation to operation for integer stuff, or use the no-op casts provided to convert between integer and FP domain). — BeeOnRope, Jul 01 '19 at 22:30

Peter Cordes · Answer 1 · 2019-07-02T00:41:23.900

3

Well first of all, SIMD vectors and std::vector have basically nothing to do with each other. I know you already know this, but future readers should think carefully about whether this is really something they want to do.

It's safe; .data() has to return a pointer that can be read or written at any valid index. It's certainly safe in practice given the implementation details of real std::vector libraries. And I'm pretty sure in the abstract as far as on-paper standards.

From comments, it seems you're worried about strict-aliasing UB.

Read/write of other objects via may_alias pointer types (including char* or __m256i*) is fine. memcpy(&a, &b, sizeof(a)) is a common example of modifying the object-representation of a via char*. There's nothing special about memcpy itself; that's well-defined because of the char* aliasing special case.

may_alias is a GNU C extension that lets you define types other than char which are allowed to alias the way char* can. GNU C's definition of __m128 / __m256i is in terms of GNU C native vectors like typedef long long __m256i __attribute((vector_size(32), may_alias)); Other C++ implementations (like MSVC) define __m256i differently, but the Intel intrinsics API guarantees that aliasing vector-pointers onto other types is legal in any case where char* / memcpy would be.

See also Is `reinterpret_cast`ing between hardware vector pointer and the corresponding type an undefined behavior?

Also: SSE: Difference between _mm_load/store vs. using direct pointer access - loadu / storeu are like casting a an aligned(1) version of the vector type before dereferencing. So all this reasoning about pointers and aliasing applies to passing a pointer to _mm_storeu, not just to to dereferencing directly.

Idiomatic; well sure, this looks like pretty idiomatic C++. I might still use C-style casts with intrinsics just because reinterpret is so long to read and the poorly-designed intrinsics API for integer vectors needs it all over the place. Maybe a templated wrapper function for si256 load/loadu and store/storeu would be appropriate, that casts to __m256i* or const __m256i* from any pointer type.

I might prefer something that passed the __m256i elements to the constructor of out, though, to stop stupid compilers from potentially zeroing the memory and then storing the vector. But hopefully that doesn't happen.

In practice gcc and clang do optimize away the dead stores to zero 8 elements before storing the vector. Any attempt to use the vector(begin, end) iterator constructor instead makes things worse, with extra code for exception handling on top of the store/reload of in to the stack (around new), then storing it into the newly-allocated memory.

See some attempts on the Godbolt compiler explorer, note that they save/restore r13 where @Bee's version doesn't, as well having extra code generated outside the normal path through the function. This goes away with -fno-exceptions, but then they're just equal, not better, than @Bee's version. So use the code in the question; it compiles at least as well as any of my attempts to be different.

I might also prefer doing something to get the new std::vector<uint32_t> allocated with 32-byte aligned memory, if that's possible without changing the template type. I'm not sure if that is possible.

Even if we could just make this initial allocation aligned in practice without changing the type to make that a compile-time guarantee for future use, that would potentially help. AVX code that leaves unaligned handling to HW would benefit from not having cache-line splits.

But I don't think that's possible either without hacking a custom constructor for std::vector that does the initial allocation with an aligned new, assuming that's compatible with regular delete.

If you can use a std::vector<uint32_t, some_aligned_allocator> everywhere in your code, that might be worth doing. But probably not worth the trouble if you have to pass it to code that uses normal vector<uint32_t>.

You could lie to your compiler because that type is binary-compatible (but not source-compatible) with regular std::vector<uint32_t>, on systems where aligned new/delete are compatible with plain new/delete. But I don't recommend that.

edited Jul 02 '19 at 00:41

answered Jun 27 '19 at 08:16

Peter Cordes

328,167
45
605
847

I don't mean "idiomatic" in terms of whether the functionality offered by the `v2v` functions is idiomatic itself, but rather whether given that you need those functions (e.g., due to an external requirement), whether the _implementation_ shown is idiomatic in terms of SIMD intrinsics. – BeeOnRope Jul 01 '19 at 22:26
1

AFAIK it's not possible to get aligned memory for `std::vector` without changing the allocator, which changes its type. – BeeOnRope Jul 01 '19 at 22:28
About "might prefer something..." do you mean [like this](https://godbolt.org/z/jXvrwo)? Surprisingly, gcc seems to generate worse code in this case, apparently it also has to handle the possibility of an exception. Presumably because it has to allocate before copying the values from the iterator, and some iterators could throw (but this one cannot). clang gets it right and seems to generate comparable code in both cases. – BeeOnRope Jul 01 '19 at 22:38
@BeeOnRope: I think your version has strict-aliasing UB, but yeah I wasn't expecting that exception-handling overhead. I put a few different attempts on Godbolt and they're all worse unless you use `-fno-exceptions`. Then they're equal or still worse if they use 2 separate 128-bit loads/stores. Or maybe that's better if it saves another vzeroupper, but we're looking at a non-inlined version of this function. – Peter Cordes Jul 01 '19 at 23:30
Heh, I thought *your* version maybe had strict aliasing UB. Vector types are special wrt aliasing ("may alias"), but it is not clear to me in exactly which way. The original version writes to the vector data *inside* the `store` intrinsic, which I thought was always kosher. The pass-it-to-the-initializer version reinterprets the vector type as a `uint32_t` array and this was less clear to me. I am not sure if there is any aliasing issue in any of the cases though because no two objects actually alias. There is reinterpretation, but no aliasing. – BeeOnRope Jul 02 '19 at 00:17
@BeeOnRope: yes, your version is free of strict-aliasing UB. Dereferencing `reinterpret_cast(in)` is not ok, though, for `__m256i in`. My understanding is that `__m256i*` is like `char*`: you can access anything via `char*`, but you can't access a `char array[]` with `uint32_t*`. But if you have a `__v8si` vector of 8x `int`, accessing it (via the `std::vector` constructor) with `int*` may be legal because they types are alias-compatible. – Peter Cordes Jul 02 '19 at 00:21
I know even without a `may_alias` attribute on the vector type, casting an `int*` to `__v8si*` and dereferencing is ok. But Intel's API requires that you can load from a `float[]` array or a `__m256d` or whatever with `__m256i*`, so the vector types need a `may_alias` attribute to make them like `char`. (I don't think I've seen documentation that vector pointers are exactly like `char*`, I think that's just my assumption. But I've never been corrected on that by gcc devs, even when the subject has come up like IIRC Marc Glisse explaining that vector aliasing is one directional, like char*.) – Peter Cordes Jul 02 '19 at 00:24
@BeeOnRope: yes, I did. >. – Peter Cordes Jul 02 '19 at 00:26
So for `char *`, you can cast anything to `char *` and then read the representation, but I forget - it is also allowed to write through an aliased `char *` pointer? – BeeOnRope Jul 02 '19 at 00:28
Cause there is kind of a 2x2 matrix here, from {read, write} and {cast to vector type, cast from vector type}. I think you are saying that casting _from_ a vector type to something else (like a `short *` or whatever) and then making accesses to that other thing is not allowed, possibly with an exception for "alias compatible" types like `int` (but that seems like it would be highly implementation specific, since the `__m256i` type should be opaque), right? Then you have the cast-to case, and you think read is allowed, but what about write? – BeeOnRope Jul 02 '19 at 00:30
Finally, you have the question of {inside, outside an intrinsic}. Where `inside` means there is no actual C++-level dereference of any pointers, you only pass things to a `load/store/whatever` intrinsic, and "outside" means you are doing C++-level dereferences. – BeeOnRope Jul 02 '19 at 00:32
@BeeOnRope: Read/write of other objects via `may_alias` pointer types (including `char*`) is fine. `memcpy(&a, &b, sizeof(a))` is a common example of modifying the object-representation of `a` via `char*`. There's nothing special about memcpy itself; that's well-defined because of the `char*` aliasing special case. – Peter Cordes Jul 02 '19 at 00:33
Interesting, I thought `memcpy` was a special snowflake that got its abilities directly from language in the standard, not via the `char *` hole for aliasing, but maybe I was mistaken... – BeeOnRope Jul 02 '19 at 00:40
@BeeOnRope: I'm pretty sure you're mistaken. `memcpy` and `memset` are by far the most common example of using `char*` aliasing, perhaps used as examples in the standard itself. But I'm sure that a manually-written `my_memcpy` like `*dst++ = *src++;` would not create UB anywhere that `memcpy` itself is legal. Non-trivially-copyable types can't be memcpyed even by standard memcpy, but anything else can be. – Peter Cordes Jul 02 '19 at 00:44
Right, but it's because `memcpy` is apparently defined in terms of char arrays, which I wasn't aware of. – BeeOnRope Jul 02 '19 at 02:42
No, the [prototype](https://en.cppreference.com/w/cpp/string/byte/memcpy) is `void *` (otherwise it would be very annoying to use, what with all the explicit casts that would be needed). That's the first thing I checked! Not that the parameter types really matter: what matters is what happens "inside" and since `memcpy` isn't defined in terms of source, you have to read the spec. It talks about objects as if they were arrays of `unsigned char`, so I agree with you... – BeeOnRope Jul 02 '19 at 02:59
@BeeOnRope: ah right, that's why we can call it without casts in C. Anyway yeah, one of those cases where the behaviour we use is just one consequence of a general rule, not a special case of its own. – Peter Cordes Jul 02 '19 at 03:01

Converting to and from __m256i and std::vector

1 Answers1