Visual Studio 2017: _mm_load_ps often compiled to movups

Question

I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups.

The data I'm using _mm_load_ps on is defined like this:

struct alignas(16) Vector {
    float v[4];
}

// often embedded in other structs like this
struct AABB {
    Vector min;
    Vector max;
    bool intersection(/* parameters */) const;
}

Now when I'm using this construct, the following will happen:

// this code
__mm128 bb_min = _mm_load_ps(min.v);

// generates this
movups  xmm4, XMMWORD PTR [r8]

I'm expecting movaps because of alignas(16). Do I need something else to convince the compiler to use movaps in this case?

EDIT: My question is different from this question because I'm not getting any crashes. The struct is specifically aligned and I'm also using aligned allocation. Rather, I'm curious why the compiler is switching _mm_load_ps (the intrinsic for aligned memory) to movups. If I know struct was allocated at an aligned address and I'm calling it via this* it would be safe to use movaps, right?

@harold He's moving four floats and aligned instructions are often more performant, particularly on some generations of cpu. — J..., Mar 09 '17 at 14:02
Possible duplicate of [SSE, intrinsics, and alignment](http://stackoverflow.com/questions/12502071/sse-intrinsics-and-alignment) — J..., Mar 09 '17 at 14:03
@J... yea Core2. Doesn't matter on anything newer as far as I know, as long as the address is actually aligned — harold, Mar 09 '17 at 14:03
tldr; `alignas` isn't perfect or a guarantee, `memcpy` can put these structs anywhere (including unaligned locations), `malloc` won't always give you aligned memory, etc. See the dupe - you generally need to write your own allocator using `_aligned_malloc`. — J..., Mar 09 '17 at 14:05
also, read through the [Remarks section here](https://msdn.microsoft.com/en-us/library/83ythb65.aspx). (This refers to `__declspec(align(#))`, but since VS2015 `alignas` support is implemented as veneer for same). — J..., Mar 09 '17 at 14:07
The discussion [here](https://connect.microsoft.com/VisualStudio/feedback/details/812192/inefficient-c-sse2-code-generation) is also interesting. — B_old, Mar 09 '17 at 14:16
It is by definition safe to use `movaps` to implement `_mm_load_ps` (regardless of actual alignment), it just apparently didn't happen — harold, Mar 09 '17 at 14:21
@harold: OK, but is that something I can influence? (Apart from writing assembler code) — B_old, Mar 09 '17 at 15:33
You need to show a complete example that demonstrates the problem, including the compiler options you've used and the version of Visual Studio 2017 you're using. — Ross Ridge, Mar 09 '17 at 17:49
@harold No `movaps` will certainly cause an exception with an unaligned address. — J..., Mar 09 '17 at 17:59
@J... yes and `_mm_load_ps` is allowed to do that too, though it doesn't have to — harold, Mar 09 '17 at 18:01
On VS and ICC, if you compile for AVX or higher, the compiler almost never issues aligned SIMD load/stores. It's allowed to do that since it's not a loss of functionality and all processors starting from Nehalem have no penalty for using unaligned load/stores when the address is aligned. They do it because it makes the compiler simpler (not have to choose between aligned/unaligned) and it doesn't crash if it's misaligned. Though I strongly disagree with that latter one since I'd much prefer that it actually crash on misalignment since that's a bug that should be fixed, not hidden. — Mysticial, Mar 10 '17 at 00:01
@Mystical: That's good information, but I just compile for x64. Does the same apply there? — B_old, Mar 10 '17 at 09:25
@Mysticial Your answer sounds pretty convincing to me. Maybe post it as an actual answer if you have the time — Guillaume Gris, Aug 02 '17 at 08:07
Related: [Is there a way to force visual studio to generate aligned sse intrinsics](https://stackoverflow.com/q/61816101) - maybe not. — Peter Cordes, Aug 29 '22 at 18:42

Mysticial · Accepted Answer · 2018-09-04T17:36:52.107

11

On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore.

When compiling for AVX or higher:

The Microsoft compiler (>VS2013?) doesn't generate aligned loads. But it still generates aligned stores.
The Intel compiler (> Parallel Studio 2012?) doesn't do it at all anymore. But you'll still see them in ICC-compiled binaries inside their hand-optimized libraries like memset().
As of GCC 6.1, it still generates aligned load/stores when you use the aligned intrinsics.

The compiler is allowed to do this because it's not a loss of functionality when the code is written correctly. All processors starting from Nehalem have no penalty for unaligned load/stores when the address is aligned.

Microsoft's stance on this issue is that it "helps the programmer by not crashing". Unfortunately, I can't find the original source for this statement from Microsoft anymore. In my opinion, this achieves the exact opposite of that because it hides misalignment penalties. From the correctness standpoint, it also hides incorrect code.

Whatever the case is, unconditionally using unaligned load/stores does simplify the compiler a bit.

New Relevations:

Starting Parallel Studio 2018, the Intel Compiler no longer generates aligned moves at all - even for pre-Nehalem targets.
Starting from Visual Studio 2017, the Microsoft Compiler also no longer generates aligned moves at all - even when targeting pre-AVX hardware.

Both cases result in inevitable performance degradation on older processors. But it seems that this is intentional as both Intel and Microsoft no longer care about old processors.

The only load/store intrinsics that are immune to this are the non-temporal load/stores. There is no unaligned equivalent of them, so the compiler has no choice.

So if you want to just test for correctness of your code, you can substitute in the load/store intrinsics for non-temporal ones. But be careful not to let something like this slip into production code since NT load/stores (NT-stores in particular) are a double-edged sword that can hurt you if you don't know what you're doing.

edited Sep 04 '18 at 17:36

answered Aug 02 '17 at 16:45

Mysticial

464,885
45
335
332

Related: gcc also really likes alignment when auto-vectorizing, and goes scalar until an alignment boundary (with fully-unrolled intro/cleanup code, which is a lot of code-bloat with AVX2 and small elements). It does this even with `-mtune=skylake` or something. Anyway, making sure gcc knows about any alignment guarantees you can give it will reduce code-bloat and avoid a conditional branch or two when auto-vectorizing. – Peter Cordes Aug 02 '17 at 22:03
NT load on write-back memory runs exactly identical to a normal load, on Intel Sandybridge-family at least. They could have made it work somewhat like prefetchNTA, but didn't (probably because it would need hardware prefetchers that were NT-aware for it to not suck). (Working on an update to https://stackoverflow.com/questions/32103968/non-temporal-loads-and-the-hardware-prefetcher-do-they-work-together; turns out my guess was wrong that it did something like fetching into only one way of cache to avoid pollution. Only pfNTA does that.) – Peter Cordes Aug 02 '17 at 22:08
@PeterCordes Interestingly, the NT load throughput is only 1/cycle on Skylake X as opposed to 2/cycle for all other loads. ([according to AIDA64](https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel0050654_SkylakeX_InstLatX64.txt)) – Mysticial Aug 02 '17 at 22:14
On Skylake-S (desktop), reloading the same 64 bytes with `movntdqa xmm0, [rsi]` / `movntdqa xmm1, [rsi+16]`, etc. it runs ~1.71 per clock, vs. 2.0 per clock for `movdqa`. So even for the most trivial case, it's slower. Thanks for pointing that out. – Peter Cordes Aug 02 '17 at 22:22
Those AIDA64 numbers show that AVX512 EVEX `vmovntdqa` (1 per 1.08) is different from regular SSE or AVX VEX `movntdqa` (1 per 0.52). And that EVEX `VMOVNTDQA + VMOVNTDQ x/y/zmm` reload/store still has terrible latency, but throughput is 1 per ~19.25c instead of being the same as latency. (And ZMM NT store/reload latency is lower than the other two sizes, which is another hint that full-cache-line NT stores are special. Being much higher single-threaded bandwidth than narrower NT stores was already a big hint.) – Peter Cordes Aug 02 '17 at 22:33
Yeah. I haven't tried to figure out what changed underneath. But when I toggle NT-stores in my code, the difference was drastic (something like 10 - 15%) difference. This is more than what I saw on Haswell. Granted, much of that might've had to do with the overall memory bandwidth bottleneck. – Mysticial Aug 02 '17 at 22:59
1

Regarding "hiding misalignment penalties": This also happens with clang/gcc and AVX if a `_mm_load[u]_ps` can be fused with another operation (like `vaddps`): https://godbolt.org/z/2ZL5FQ So it is also not always trivial to force clang/gcc to actually generate `[v]movaps` instructions. – chtz May 15 '20 at 11:03

Visual Studio 2017: _mm_load_ps often compiled to movups

1 Answers1

Linked

Related