7

I hope this won't turn out to be a really dumb question I'll be embarrassed about later, but I've always been confused about SIMD intrinsics to the point where I find it easier to rationalize assembly code than the intrinsics.

So the main question I have is about using SIMD intrinsic data types like __m256. And just to skip to the point, my question is about doing things like this:

class PersistentObject
{
     ...
private:
     std::vector<__m256, AlignedAlloc<__m256, 32>> data;
};

Is that gross, acceptable, will it trip up compilers when it comes to generating the most efficient code? That's the part that's confusing me right now. I'm at the inexperienced level where, when I have a hotspot and have exhausted all other immediate options, I give SIMD intrinsics a shot and always looking to back out my changes if they don't improve performance (and I've backed out so many SIMD-related changes).

But this question and confusion I have about storing SIMD intrinsic types persistently also made me realize that I don't really understand how these intrinsics work at a fundamental compiler level. My mind wants to think of __m256 like an abstract YMM register (not necessarily allocated yet). That starts to click with me when I see load and store instructions. I think of them as hints for the compiler to perform its register allocation.

And I didn't have to put much more thought into this than that before because I always used SIMD types in a temporary way: _mm256_load_ps to __m256, do some operations, store results back to 32-bit SPFP 256-bit aligned array float[8]. I got away with thinking of __m256 like a YMM register.

Abstract YMM Register?

But recently I was implementing a data structure which attempts to revolve around SIMD processing (a simple one representing a bunch of vectors in SoA fashion), and here it becomes convenient if I can just work predominantly with __m256 without constantly loading from an array of floats and storing the results back after. And in some quick tests, MSVC at least seems to emit the appropriate instructions mapping my intrinsics to assembly (along with proper aligned loads and stores when I access data out of the vector). But that breaks my conceptual model of thinking of __m256 as an abstract YMM register, because storing these things persistently implies something more like a regular variable, but at that point what's up with the loads/movs and stores?

So I'm tripping a bit over the conceptual model I built in my head about how to think of all this stuff, and my hope is that maybe someone experienced can immediately recognize what's broken with the way I'm thinking about this stuff and give me that eureka answer which debugs my brain. I hope this question isn't too dumb (I have an uneasy feeling that it is, but I have tried to discover the answer elsewhere only to still find myself confused). So ultimately, is it acceptable to directly store these data types persistently (implying that we'd reload the memory at some point after it already spilled out of a YMM register without using _mm_load*), and, if so, what's wrong with my conceptual model?

Apologies if this is such a dumb question! I'm really wet behind the ears with this stuff.

Some More Details

Thanks so much for the helpful comments so far! I suppose I should share some more details to make my question less fuzzy. Basically I'm trying to create a data structure which is little more than a collection of vectors stored in SoA form:

xxxxxxxx....
yyyyyyyy....
zzzzzzzz....

... and mainly with the intention of being used for hotspots where the critical loops have a sequential access pattern. But at the same time the non-critical execution paths might want to randomly access a 5th 3-vector in AoS form (x/y/z), at which point we're inevitably doing scalar access (which is perfectly fine if that's not so efficient since they're not critical paths).

In this one peculiar case, I'd find it a lot more convenient from an implementation standpoint to just persistently store and work with __m256 instead of float*. It would prevent me from sprinkling a lot of vertical loopy code with _mm_loads* and _mm_stores* because the common case in this scenario (both in terms of critical execution and the bulk of the code) are implemented with SIMD intrinsics. But I'm not sure if this is a sound practice over just reserving __m256 for just short-lived temporary data, local to some function, to load some floats in to __m256, do some operations, and store results back as I usually have done in the past. It would be quite a bit more convenient, but I'm a bit worried that this convenient type of implementation might choke some optimizers (though I haven't found that to be the case yet). And if they don't trip up optimizers, then the way I've been thinking about these data types has been a bit off all this time.

So in this case, it's like if it's perfectly fine to do this stuff and our optimizers handle this brilliantly all the time, then I'm confused because the way I was thinking about this stuff and thinking we needed those explicit _mm_load and _mm_store in short-lived contexts (local to a function, i.e.) to help out our optimizers was all wrong! And that sorta upsets me that this works fine, because I didn't think it was supposed to work fine! :-D

Answers

There are a couple of comments from Mysticial that really hit the spot for me and helped fix my brain a bit as well as giving me some reassurance that what I want to do is all right. It was given in the form of a comment instead of an answer so I'll quote it here in case anyone ever happens to have a similar confusion I had.

If it helps, I have about 200k LOC written exactly like this. IOW, I treat the SIMD type as a first-class citizen. It's fine. The compiler handles them no differently than any other primitive type. So there are no issues with it.

The optimizers aren't that flimsy. They do maintain correctness within reasonable interpretations of the C/C++ standards. The load/store intrinsics aren't really needed unless you need the special ones (unaligned, non-temporal, masked, etc...)

That said, please feel free to write your own answers as well. More info the merrier! I'm really hoping to improve that fundamental understanding of how to write SIMD code with greater confidence, since I'm at the stage where I'm hesitant about everything and still second-guessing myself a lot.

Reflecting Back

Thanks again so much to everyone! I feel so much more clear now and more confident about designing code built around SIMD. For some reason I was extremely suspicious of the optimizer just for SIMD intrinsics, thinking I had to write my code in the lowest-level way possible and having those loads and stores as local as possible in a limited function scope. I think some of my superstitions came about from writing SIMD intrinsics originally against older compilers almost a couple of decades ago, and maybe back then the optimizers might have needed more help, or maybe I've just been irrationally superstitious the whole time. I was looking at it kind of like how people looked at C compilers in the 80s, putting things like register hints here and there.

With SIMD I've always had very mixed results and have a tendency, in spite of using it here and there every once in a blue moon, to constantly feel like a beginner, perhaps if only because the mixed success has made me reluctant to use it which has significantly delayed my learning process. Lately I'm trying to correct that, and I really appreciate all the help!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    If it helps, an `__m256` (or any other SSE/AVX vector type) works in conceptually the same way as, say, an `int`. The compiler will try to keep it in a register where possible, but it's free to spill it to memory if needed, either due to register pressure, or where language rules require it (taking address, aliasing, etc). – Paul R Jan 23 '18 at 17:40
  • @PaulR I see, so we can think of them in a very generalized way like this. There's something I found that just doesn't 100% click when I look at loads and stores and how it's often discouraged to access the data fields of `__m256` and instead use store instructions, e.g. I'm always confused about exactly how free we are to treat these types like regular variables, and how much we can do with them freely without confusing our optimizers. –  Jan 23 '18 at 17:43
  • 1
    In general for efficient SIMD programming you should hardly ever need to access individual vector elements (at least not within the "hot" part of your code - inner loops, etc). Rule of thumb: your SIMD operations should be "vertical", and operate homogeneously on contiguous elements in memory. – Paul R Jan 23 '18 at 17:46
  • @PaulR One of the things I learned early on getting my feet wet with this stuff is to avoid casting to/from other data types like `float*` or `int32_t*` and instead use load and store instructions. And maybe some of my confusion is related to that, because it seems like there could be a lot of ways to trip up our optimizers. There are occasionally times where it'd be convenient to just fetch a scalar field -- like for non-critical code if we created a generalized data structure with random access of a specific nth vector/component. But I have often erred on the side of just avoiding such [...] –  Jan 23 '18 at 17:48
  • @PaulR ... things outright by avoiding the use of `__m256` outside of short-lived contexts, and instead work with an array of floats in those random-access cases, i.e., But I recently found a case where it might be considerably more convenient to just persistently store and work with `__m256` 90% of the time. –  Jan 23 '18 at 17:49
  • 1
    I'm just trying to dig out some related/relevant material here on SO, which might help, e.g. [this answer](https://stackoverflow.com/a/19378356/253056) and [this one](https://stackoverflow.com/a/31369299/253056). – Paul R Jan 23 '18 at 17:49
  • 2
    Nothing wrong with switching between vectorized access and scalar access *per se*, but ideally you want to make sure you do plenty of work in each domain between switches, to amortize the cost of loading/storing data from/to memory. – Paul R Jan 23 '18 at 17:54
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/163737/discussion-between-team-upvote-and-paul-r). –  Jan 23 '18 at 17:54
  • 2
    @TeamUpvote If it helps, I have about 200k LOC written exactly like this. IOW, I treat the SIMD type as a first-class citizen. It's fine. The compiler handles them no differently than any other primitive type. So there are no issues with it. – Mysticial Jan 23 '18 at 18:35
  • @Mysticial That's awesome! Thanks so much. I was confused a bit about the loads and stores because with a local variable, I thought the compiler could kind of "pin the memory location" to some degree, even if there are stack spills (sorry if the way I'm describing this is so weird, I don't know the terminology). But when we start storing these things persistently in some container (with properly aligned allocation, of course), then it's moving all over the place in memory and I had this suspicion that it would confuse the optimizer. If it doesn't, that's so fantastic. –  Jan 23 '18 at 18:38
  • @Mysticial Basically I was thinking of `_mm_load*` intrinsics as kinda saying, "Hey compiler, this region of memory should move into a YMM register, and I'm telling you whether it's aligned or not." And I thought once we start storing something like `__m256 ` persistently, we're not supplying the optimizer such info anymore which I thought might be crucial. –  Jan 23 '18 at 18:40
  • 1
    The optimizers aren't that flimsy. They do maintain correctness within reasonable interpretations of the C/C++ standards. The load/store intrinsics aren't really needed unless you need the special ones (unaligned, non-temporal, masked, etc...) – Mysticial Jan 23 '18 at 18:40
  • @Mysticial That's great to hear! I think I was on the paranoid side thinking the optimizers were quite flimsy and thought SIMD intrinsics were a very low-level kind of abstraction that still requires us to provide as much info as possible about the nature of the memory for the optimizers to do a good job. You kinda hit the spot with my question -- "flimsy" was kind of a keyword I had in the back of my head about all this stuff. –  Jan 23 '18 at 18:43
  • @Mysticial There is kind of a curiosity there for me... like why can't we just use `_mm_mul*` on `float*`? What does `__m128`, `__m256` tell the compiler as a separate data type? Is it telling it that the data is guaranteed to be properly aligned to 128-bit or 256-bit boundaries once it's represented as `__m256` as opposed to `float*`, e.g.? –  Jan 23 '18 at 18:47
  • 2
    @TeamUpvote Probably because the instruction natively works on registers (though the last operand can be from memory). They also probably didn't want to provide overloads for every single combination of inputs. And since they need to compatible with C, they'd need different names for all those overloads. Right now we're at something like 5000 intrinsics - enough to have performance implications for parsing the header. – Mysticial Jan 23 '18 at 18:56
  • @Mysticial I see! Thanks so much to you and everyone else. My mind is clearing up quite a bit already, and I'm also really happy to hear that I can use these types like "first-class citizens". –  Jan 23 '18 at 19:00
  • 1
    Also, don't look for too much logic in those intrinsics, they were chosen a bit randomly, not particularly cleverly or even consistently. – Marc Glisse Jan 23 '18 at 21:14

1 Answers1

5

Yes, __m256 works as a regular type; it doesn't have to be register-only. You can make arrays of __m256, pass them by reference to non-inline functions, and whatever else.

The main caveat is that it's an "over-aligned" type: the compiler assumes that a __m256 in memory is 32-byte aligned, but std::max_align_t usually only has 8 or 16 byte alignment on the mainstream C++ implementations. So you need that custom allocator for std::vector or other dynamic allocations, because std::vector<__m256> will allocate memory that's not sufficiently aligned to store __m256. Thanks, C++ (although C++17 apparently will finally fix that).


But that breaks my conceptual model of thinking of __m256 as an abstract YMM register, because storing these things persistently implies something more like a regular variable, but at that point what's up with the loads/movs and stores?

The __m128 _mm_loadu_ps(float*) / _mm_load_ps intrinsics mainly exist to communicate alignment information to the compiler, and (for FP intrinsics) to type-cast. With integer you they don't even do that, and you have to cast pointers to __m128i*.

(AVX512 intrinsics finally use void* instead of __m512i*, though.)

_mm256_load_ps(fp) is basically equivalent to *(__m256*)fp: aligned load of 8 floats. __m256* is allowed to alias other types, but (as I understand it) the reverse is not true: it's not guaranteed to be safe to get the 3rd element of __m256 my_vec with code like ((float*)my_vec)[3]. That would be a strict-aliasing violation. Although it does work in practice at least most of the time on most compilers.

(See Get member of __m128 by index?, and also print a __m128i variable for a portable way: storing to a tmp array often optimizes away. But if you want a horizontal sum or something, it's usually best to use vector shuffle and add intrinsics, instead of hoping the compiler will auto-vectorize a store + scalar add loop.)


Maybe at some point in the past when intrinsics were new, you really did get a movaps load every time your C source contained _mm_load_ps, but at this point it's not particularly different from the * operator on a float*; the compiler can and will optimize away redundant loads of the same data, or optimize vector store / scalar reload into a shuffle.


But at the same time the non-critical execution paths might want to randomly access a 5th 3-vector in AoS form (x/y/z), at which point we're inevitably doing scalar access.

The biggest caveat here is that the code to get scalars out of __m256 objects is going to be ugly, and possibly not compile efficiently. You can hide the ugliness with wrapper functions, but efficiency problems might not go away easily, depending on your compiler.

If you write portable code that doesn't use gcc-style my_vec[3] or MSVC my_vec.m256_f32[3], storing the __m256 to an array like alignas(32) float tmp [8] might not optimize away, and you might get a load into a YMM register and a store. (And then a vzeroupper).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks so much! This clears things up even further for me. And I started out with SIMD intrinsics in the late 90s and early 2000s when I think it was, as you said, mapping things like `_mm_load_ps` directly to `movaps`.and long before optimizers were auto-vectorizing things. I think part of my former confusion related to those artifacts of older compilers. I was still thinking of things mapping on a relatively one-to-one basis with instructions and registers (with the exception of intrinsics which were noted not to do so). –  Jan 23 '18 at 20:52
  • 1
    @MarcGlisse Ooh, could you give me a bit more or point me to a resource (apologies if it's lazy, I did try a google search on alignment guarantees that come with C++17 but turned up short). How does that work? Does `std::allocator` use `alignof(T)` now and with an aligned heap allocation? I'm a bit behind on the latest compilers (have to use older ones at my workplace unfortunately). –  Jan 23 '18 at 21:16
  • 1
    TBH the real problem is implementations defining both `max_align_t` and `__m256`, and being inconsistent between the two. That's not a C++ problem, that's just an implementation flaw. I can understand why it's happening (different parts of the compiler team not speaking to each other) but it's rather annoying. It's fairly obvious the intrinsics people have an assembly background, with a bit of C experience. – MSalters Jan 24 '18 at 10:47
  • @MSalters I see, that also helps me out a bit since I was under the impression before that perhaps we had to use those intrinsics more directly. I was reluctant to wrap it out of fear that it might confuse optimizers (though also because originally I tried to wrap it and ended up hating my designs; started off with a naive AoS 4-vector and then we ended up getting AVX registers). Somehow I was under the impression that SIMD intrinsics are somehow unusual and needed to be written in the lowest-level way possible. It would help if I could at least write some light wrappers... –  Jan 24 '18 at 11:19
  • @MSalters ... like ones that at least use function overloading for loads and stores and operations since I'm always forgetting the names of the intrinsics... like `_mm_load_ps`, `_mm256_load_ps`... those I can remember. But `_mm_load_si128` -- I'm always forgetting that one when, say, finding the minimum among two signed 32-bit integer vectors is called `_mm_min_epi32` instead of `_mm_min_si128`. –  Jan 24 '18 at 11:22
  • 1
    @MSalters: you don't really want `malloc` to waste space aligning *every* allocation of 64 bytes or larger to a 64-byte boundary (on a target with AVX512), just in case it will be used to hold a `__m512`. Some code does make lots of small allocations (e.g. of strings). Of course, aligning even allocations too small to hold such an object would be far worse, and I think current `malloc` designs (like glibc's) align *everything* to `max_align_t`, even allocations smaller than 16 bytes. – Peter Cordes Jan 24 '18 at 17:23
  • 1
    @TeamUpvote: `epi32` means it operates on 32-bit signed-integer elements. `epu32` means unsigned 32-bit. `si128` means "integer but no element boundaries", like load/store, or `pslldq` (byte-shift across the whole register). The `e` in `epi` is "extended", vs. the MMX version. But the `si128` isn't ambiguous with MMX, because it's `128` instead of `64`, I guess. (Intel has a bad habit of making us type longer names for new versions, especially crap like `_mm256_loadu_si256`. Thanks, I got it, yes I'm using 256-bit intrinsics, and really love typing that number...) Asm mnemonics are nicer – Peter Cordes Jan 24 '18 at 17:26
  • 1
    @PeterCordes Speaking of naming, the conversion and move intrinsics are a hell and a half. – Mysticial Jan 24 '18 at 21:47
  • 1
    @MSalters max_align_t is part of the ABI. Compilers can add new types like __m512 whenever Intel wants, but changing max_align_t would be an ABI break (in addition to the reasons already mentioned not to over-align everything). That same reason prevents __int128 from being intmax_t. – Marc Glisse Jan 25 '18 at 22:13
  • 1
    @TeamUpvote Search for "align" on the C++17 Wikipedia page? (works just as well with all other such resources) – Marc Glisse Jan 25 '18 at 22:16
  • 1
    @MarcGlisse Dug up some info and the proposal for `operator new/new[]`, though `std::allocator` uses this now by default? That's the part I couldn't figure out as standard library containers don't use default `operator new`, but rather `std::allocator` with placement new to construct elements in place. If `std::allocator` is just guaranteed to deal with over-aligned data with aligned allocations, that would make my life a lot easier. –  Feb 07 '18 at 01:29
  • 2
    @TeamUpvote In [allocator.members], for allocate, it says "aligned appropriately for objects of type T." – Marc Glisse Feb 07 '18 at 07:43