I hope this won't turn out to be a really dumb question I'll be embarrassed about later, but I've always been confused about SIMD intrinsics to the point where I find it easier to rationalize assembly code than the intrinsics.
So the main question I have is about using SIMD intrinsic data types like __m256
. And just to skip to the point, my question is about doing things like this:
class PersistentObject
{
...
private:
std::vector<__m256, AlignedAlloc<__m256, 32>> data;
};
Is that gross, acceptable, will it trip up compilers when it comes to generating the most efficient code? That's the part that's confusing me right now. I'm at the inexperienced level where, when I have a hotspot and have exhausted all other immediate options, I give SIMD intrinsics a shot and always looking to back out my changes if they don't improve performance (and I've backed out so many SIMD-related changes).
But this question and confusion I have about storing SIMD intrinsic types persistently also made me realize that I don't really understand how these intrinsics work at a fundamental compiler level. My mind wants to think of __m256
like an abstract YMM
register (not necessarily allocated yet). That starts to click with me when I see load and store instructions. I think of them as hints for the compiler to perform its register allocation.
And I didn't have to put much more thought into this than that before because I always used SIMD types in a temporary way: _mm256_load_ps
to __m256
, do some operations, store results back to 32-bit SPFP 256-bit aligned array float[8]
. I got away with thinking of __m256
like a YMM register.
Abstract YMM Register?
But recently I was implementing a data structure which attempts to revolve around SIMD processing (a simple one representing a bunch of vectors in SoA fashion), and here it becomes convenient if I can just work predominantly with __m256
without constantly loading from an array of floats and storing the results back after. And in some quick tests, MSVC at least seems to emit the appropriate instructions mapping my intrinsics to assembly (along with proper aligned loads and stores when I access data out of the vector). But that breaks my conceptual model of thinking of __m256
as an abstract YMM
register, because storing these things persistently implies something more like a regular variable, but at that point what's up with the loads/movs and stores?
So I'm tripping a bit over the conceptual model I built in my head about how to think of all this stuff, and my hope is that maybe someone experienced can immediately recognize what's broken with the way I'm thinking about this stuff and give me that eureka answer which debugs my brain. I hope this question isn't too dumb (I have an uneasy feeling that it is, but I have tried to discover the answer elsewhere only to still find myself confused). So ultimately, is it acceptable to directly store these data types persistently (implying that we'd reload the memory at some point after it already spilled out of a YMM register without using _mm_load*
), and, if so, what's wrong with my conceptual model?
Apologies if this is such a dumb question! I'm really wet behind the ears with this stuff.
Some More Details
Thanks so much for the helpful comments so far! I suppose I should share some more details to make my question less fuzzy. Basically I'm trying to create a data structure which is little more than a collection of vectors stored in SoA form:
xxxxxxxx....
yyyyyyyy....
zzzzzzzz....
... and mainly with the intention of being used for hotspots where the critical loops have a sequential access pattern. But at the same time the non-critical execution paths might want to randomly access a 5th 3-vector in AoS form (x/y/z), at which point we're inevitably doing scalar access (which is perfectly fine if that's not so efficient since they're not critical paths).
In this one peculiar case, I'd find it a lot more convenient from an implementation standpoint to just persistently store and work with __m256
instead of float*
. It would prevent me from sprinkling a lot of vertical loopy code with _mm_loads*
and _mm_stores*
because the common case in this scenario (both in terms of critical execution and the bulk of the code) are implemented with SIMD intrinsics. But I'm not sure if this is a sound practice over just reserving __m256
for just short-lived temporary data, local to some function, to load some floats in to __m256, do some operations, and store results back as I usually have done in the past. It would be quite a bit more convenient, but I'm a bit worried that this convenient type of implementation might choke some optimizers (though I haven't found that to be the case yet). And if they don't trip up optimizers, then the way I've been thinking about these data types has been a bit off all this time.
So in this case, it's like if it's perfectly fine to do this stuff and our optimizers handle this brilliantly all the time, then I'm confused because the way I was thinking about this stuff and thinking we needed those explicit _mm_load
and _mm_store
in short-lived contexts (local to a function, i.e.) to help out our optimizers was all wrong! And that sorta upsets me that this works fine, because I didn't think it was supposed to work fine! :-D
Answers
There are a couple of comments from Mysticial that really hit the spot for me and helped fix my brain a bit as well as giving me some reassurance that what I want to do is all right. It was given in the form of a comment instead of an answer so I'll quote it here in case anyone ever happens to have a similar confusion I had.
If it helps, I have about 200k LOC written exactly like this. IOW, I treat the SIMD type as a first-class citizen. It's fine. The compiler handles them no differently than any other primitive type. So there are no issues with it.
The optimizers aren't that flimsy. They do maintain correctness within reasonable interpretations of the C/C++ standards. The load/store intrinsics aren't really needed unless you need the special ones (unaligned, non-temporal, masked, etc...)
That said, please feel free to write your own answers as well. More info the merrier! I'm really hoping to improve that fundamental understanding of how to write SIMD code with greater confidence, since I'm at the stage where I'm hesitant about everything and still second-guessing myself a lot.
Reflecting Back
Thanks again so much to everyone! I feel so much more clear now and more confident about designing code built around SIMD. For some reason I was extremely suspicious of the optimizer just for SIMD intrinsics, thinking I had to write my code in the lowest-level way possible and having those loads and stores as local as possible in a limited function scope. I think some of my superstitions came about from writing SIMD intrinsics originally against older compilers almost a couple of decades ago, and maybe back then the optimizers might have needed more help, or maybe I've just been irrationally superstitious the whole time. I was looking at it kind of like how people looked at C compilers in the 80s, putting things like register
hints here and there.
With SIMD I've always had very mixed results and have a tendency, in spite of using it here and there every once in a blue moon, to constantly feel like a beginner, perhaps if only because the mixed success has made me reluctant to use it which has significantly delayed my learning process. Lately I'm trying to correct that, and I really appreciate all the help!