C++ load and store optimizations and heap objects

Question

I am trying to wrap my head around memory accesses to intrinsic types, that have or haven't been loaded into registers.

Assuming some SIMD functions which accept references to float arrays. For example,

void do_something(std::array<float, 4>& arr);
void do_something_else(std::array<float, 4>& arr);

Each function first loads the data in registers, performs its operation, then stores the result back into the array. Assuming the following snippet :

std::array<float, 4> my_arr{0.f, 0.f, 0.f, 0.f};
do_something(my_arr);
do_something_else(my_arr);
do_something(my_arr);

Does the c++ compiler optimize out the unnecessary loads and stores between function calls? Does this even matter?

I've seen libraries that wrap an __m128 type in a struct, and call the load in the constructor. What happens when you store these on the heap and try to call intrinsics on them? For example,

struct vec4 {
    vec4(std::array<float, 4>&) {
        // do load
    }

    __m128 data;
};

std::vector<vec4> my_vecs;
// do SIMD work

Do you have to load/store the data every access? Or should these classes declare a private operator new, so they aren't stored on the heap?

You can see generated assembly of any `C++` code on [godbolt](https://godbolt.org/). You may pick a number of compilers and play with optimisations flags. — Fureeish, Jun 23 '19 at 20:52
A reference in C++ is a guaranteed non-nullptr pointer. So you really only pass a pointer. The compiler will not optimize the load & store of your functions unless they are inline. — Alexis Wilke, Jun 23 '19 at 20:56
`std::vector` of a SIMD type is [a dangerous pattern](https://stackoverflow.com/q/5216071/555045) — harold, Jun 23 '19 at 20:58
@harold Assuming the vector is correctly aligned (either `operator new[]` or custom allocator), do you have to load/store in registers for every operation? For example : https://scc.ustc.edu.cn/zlsc/tc4600/intel/2017.0.098/compiler_c/common/core/GUID-BF75C173-FE94-4448-9F99-E25FBDF35090.html Are the `+=` always loading in registers? Wouldn't this be super slow? — scx, Jun 23 '19 at 21:10
@scx maybe, maybe not, but why take the risk? You can load/store when you need and use `__m128` for the main calculation — harold, Jun 23 '19 at 21:15
@harold So my conclusion would be, if your type *can* be stored on the heap (whether that is a good idea or not), you probably want to track a loaded state. Adding a bool to the implementation. This would then allow you to operate on the objects using overloaded operators (for example). — scx, Jun 23 '19 at 21:23
@scx I don't think that works, I meant to do it statically. Dynamically keeping track of what is loaded and what isn't kind of works at the asm level (though I don't see an *efficient* way to do it there) but not once a compiler gets involved. — harold, Jun 23 '19 at 21:29
So basically, load/store at every operation and pay the price. Or disallow storing the wrapper on the heap. — scx, Jun 23 '19 at 21:38
My point was you don't have to take any of these options, you can use `__m128` directly where it is appropriate (during the calculation), and you don't need to use it outside of that (you can store the data in normal types, without aliasing it to a SIMD type) — harold, Jun 23 '19 at 21:56

cmaster - reinstate monica · Accepted Answer · 2019-06-23T21:11:05.227

If the compiler compiles the functions separately from the calls, it cannot optimize out the stores and loads. This is definitely the case when the functions are in one .cpp file, the calls in another .cpp file, and link time optimizations are not enabled.

However, if the compiler

sees the function definitions and their calls at the same time (or during link time optimization),
decides to inline the function calls and
decides to fuse the loops,

then it will likely remove the unnecessary stores and loads.

Note however, that none of the three points is trivial. The programmer only controls the first point, the other two are 100% at the discretion of the compiler. Consequently, you generally have to assume that such optimizations do not happen. Chances for inlining rise a bit if your functions are actually templates (which also guarantees that point 1 is satisfied), but whether the compiler actually fuses the loops is out of your control.

Regarding structs that contain SIMD types: It's perfectly legal for a SIMD type to reside on the heap. There's absolutely no difference from it being allocated on the stack.

However, you cannot just alias a std::array<float, 4> with a __m128, that would violate strict aliasing rules. Reinterpretation of std::array<float, 4> to __m128 can only happen safely with a copy (reinterpretation to char*, copy, reinterpretation to __m128), otherwise your compiler is allowed to mix up the accesses to the array and the SIMD type.

Memory accesses are always potentially costly. How costly depends on the size of the data and the size of your caches. If you data fits in L1 cache, the overhead is a few CPU cycles for the load instruction. If your data is larger than the last level cache, you'll be transferring it through the memory bus at each load and store. In this case, every reduction of memory bus usage is a win. — cmaster - reinstate monica, Jun 23 '19 at 21:17

C++ load and store optimizations and heap objects

1 Answers1