Aligned and unaligned memory access with AVX/AVX2 intrinsics

Question

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g.

vaddps ymm0,ymm0,YMMWORD PTR [rax]

the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as

vmovaps ymm0,YMMWORD PTR [rax]

the load address has to be aligned (to multiples of 32), otherwise an exception is raised.

What confuses me is the automatic code generation from intrinsics, in my case by gcc/g++ (4.6.3, Linux). Please have a look at the following test code:

#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define SIZE (1L << 26)
#define OFFSET 1

int main() {
  float *data;
  assert(!posix_memalign((void**)&data, 32, SIZE*sizeof(float)));
  for (unsigned i = 0; i < SIZE; i++) data[i] = drand48();
  float res[8]  __attribute__ ((aligned(32)));
  __m256 sum = _mm256_setzero_ps(), elem;
  for (float *d = data + OFFSET; d < data + SIZE - 8; d += 8) {
    elem = _mm256_load_ps(d);
    // sum = _mm256_add_ps(elem, elem);
    sum = _mm256_add_ps(sum, elem);
  }
  _mm256_store_ps(res, sum);
  for (int i = 0; i < 8; i++) printf("%g ", res[i]); printf("\n");
  return 0;
}

(Yes, I know the code is faulty, since I use an aligned load on unaligned addresses, but bear with me...)

I compile the code with

g++ -Wall -O3 -march=native -o memtest memtest.C

on a CPU with AVX. If I check the code generated by g++ by using

objdump -S -M intel-mnemonic memtest | more

I see that the compiler does not generate an aligned load instruction, but loads the data directly in the vector addition instruction:

vaddps ymm0,ymm0,YMMWORD PTR [rax]

The code executes without any problem, even though the memory addresses are not aligned (OFFSET is 1). This is clear since vaddps tolerates unaligned addresses.

If I uncomment the line with the second addition intrinsic, the compiler cannot fuse the load and the addition since vaddps can only have a single memory source operand, and generates:

vmovaps ymm0,YMMWORD PTR [rax]
vaddps ymm1,ymm0,ymm0
vaddps ymm0,ymm1,ymm0

And now the program seg-faults, since a dedicated aligned load instruction is used, but the memory address is not aligned. (The program doesn't seg-fault if I use _mm256_loadu_ps, or if I set OFFSET to 0, by the way.)

This leaves the programmer at the mercy of the compiler and makes the behavior partly unpredictable, in my humble opinion.

My question is: Is there a way to force the C compiler to either generate a direct load in a processing instruction (such as vaddps) or to generate a dedicated load instruction (such as vmovaps)?

What's the motivation for doing so? If you don't know whether the data is properly aligned, just use an unaligned load. I wouldn't say that you're at the mercy of the compiler; if you tell it to use an aligned load, I wouldn't be surprised if it segfaults in the event that the pointer isn't aligned. The fact that in some cases the compiler will emit code that works around your bug is just gravy. — Jason R, Jun 27 '15 at 15:45
Recently, compilers have started to *never* generate aligned memory accesses. It makes it easier to not make the distinction and there's no performance penalty on all processors starting from Nehalem. Personally, I'd rather it crash so it lets me know that I have a potential bug in performance. — Mysticial, Jun 27 '15 at 16:14
@JasonR: I find the behavior inconsistent. Maybe I should have included another twist: If I use `_mm256_loadu_ps` on the original code, gcc generates an unaligned load `vmovups` and a `vaddps` working on register operands, while it could have perfectly generated just a `vaddps` instruction with a memory operand as that tolerates unaligned addresses. — Ralf, Jun 27 '15 at 16:55
@Mysticial: Do you you know a reference where this transition in compiler design is described (particularly: which versions of which compilers are based on the old and new alignment assumption)? — Ralf, Jun 27 '15 at 17:00
@Ralf Visual Studio started doing it around VS2013. Intel Compiler started doing it some time between ICC11 and ICC13. I'm unsure about GCC though (if it does it at all). — Mysticial, Jun 27 '15 at 17:14
@Ralf That may be true, but what's important is whether there is a measurable performance difference between the two approaches. I would be surprised if there is in any realistic benchmark. — Jason R, Jun 28 '15 at 00:06
I believe contemporary versions of both gcc and clang will emit aligned move instructions, both when asked and if the moves are automatically generated. This can in some cases cause problems, for instance if the stack isn't aligned properly; spilling of SSE/AVX register types to the stack can cause segmentation faults. — Jason R, Jun 28 '15 at 00:07
If you use `_mm256_loadu_ps` instead does it fuse?[Last time I did this with GCC it did not fuse but MSVC did](http://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp). You are at the mercy of the compiler when it comes to fusing with intrinsics. There are no way to explicitly control the fusing with intrinsics. You have to use assembly if you want to explicitly control the fusing. — Z boson, Jun 28 '15 at 10:04
@Ralf, sorry, but I made some mistakes in my answer and had to revise it. — Z boson, Jul 10 '15 at 08:35

score 7 · Accepted Answer · edited May 23 '17 at 11:53

7

There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.

In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide here the loads are folded

vmulps  ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps  ymm8, ymm9, ymm8

However in a previous verison of GCC the loads were not folded:

vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9

The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.

edited May 23 '17 at 11:53

Community

1
1

answered Jun 28 '15 at 10:23

Z boson

32,619
11
123
226

1

No compiler will ever fold that load into the `vaddps`, because it needs the data from memory as *both* operands. If you haven't tested with AVX, you might want to test again, because this example is not a good test of whether the compiler will fold loads into later instructions, as memory operands. (BTW, fusing is what Intel's decoders do with uops. Some insns with memory operands can't micro-fuse anyway, e.g. `PINSRW`. I like the term "folding" to describe replacing a load with a memory operand.) – Peter Cordes Jul 10 '15 at 07:03
@PeterCordes, you're right. And I like your term folding better as well. I was just looking into this. I'll have to fix my answer. Give me a sec. – Z boson Jul 10 '15 at 07:52
@PeterCordes, I looked at [difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp](https://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp) and it does not appear that the alignment matters any more with folding with GCC (4.9.2). – Z boson Jul 10 '15 at 07:59
@PeterCordes, okay I fixed my answer, let me know what you think. – Z boson Jul 10 '15 at 08:17
@PeterCordes, I redid my tests [here](http://stackoverflow.com/q/21134279/2542702) a while back and I did not see a difference any more between GCC and MSVC. I did not look at it carefully but I think it's because GCC is producing essentially the same code as MSVC now. That's disappointing. – Z boson Jul 10 '15 at 08:26
2

@PeterCordes I've found that ICC15 will sometimes fold loads even if it means duplicating it. (multiple folded loads to the same address) This is usually in cases of register pressure. – Mysticial Aug 06 '15 at 20:20
1

@Mysticial: Cool. As long as it can use a single-register addressing mode, and the code isn't saturating the load ports, micro-fused loads of stuff that's already cached are nearly free. In the unfused domain, the load uop can be dispatched ahead of the other uop, so it doesn't lengthen the dependency chain it's part of. (2-register addressing modes can't micro-fuse on recent Intel: http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695. But this probably happens most often with RIP-relative loads of constants.) – Peter Cordes Aug 06 '15 at 20:29
Update on that last comment: resource conflicts from extra operations can delay the critical path. If there are dependent loads (i.e. from addresses that out-of-order execution might have to wait for) then excess loads (even of constants) might not be "free". – Peter Cordes Dec 11 '16 at 21:53

Andrey Semashev · Answer 2 · 2020-06-11T13:59:12.457

4

In addition to Z boson's answer I can tell that the problem can be caused by that the compiler assumes the memory region is aligned (because of __attribute__ ((aligned(32))) marking the array). In runtime that attribute may not work for values on the stack because the stack is only 16-byte aligned (see this bug, which is still open at the time of this writing, though some fix have made it into gcc 4.6). The compiler is in its rights to choose the instructions to implement intrinsics, so it may or may not fold the memory load into the computational instruction, and it is also in its rights to use vmovaps when the folding does not occur (because, as noted before, the memory region is supposed to be aligned).

You can try forcing the compiler to realign the stack to 32 bytes upon entry in main by specifying -mstackrealign and -mpreferred-stack-boundary=5 (see here) but it will incur a performance overhead.

edited Jun 11 '20 at 13:59

answered Dec 11 '16 at 10:15

Andrey Semashev

10,046
1
17
27

1

Unlike SSE, AVX instructions like `vaddps ymm0, ymm1, [rsp+16]` don't need their memory source operand to be aligned. (Except for `vmovaps` which explicitly requests alignment checking; that's why compilers use `vmovups` when vectorizing if array alignment isn't known at compile time, or for `loadu` / `storeu` intrinsics.) GCC does prefer to align stack memory if it's going to be using AVX instructions on it, though. – Peter Cordes Jun 11 '20 at 12:22
@PeterCordes Yes, but the compiler is not required to generate `vaddps` with a memory operand. It is within its rights to generate `vmovaps`, which means user's code is not safe until it ensures the data is properly aligned. – Andrey Semashev Jun 11 '20 at 12:34
Yes, from that source using `_mm256_load_ps`, it could do that. But your answer's justification for why it can fold the loads is backwards. AVX code can *always* fold loads without having to prove alignment first, an improvement over SSE. Also, current GCC *does* know how to over-align the stack as needed to support `alignas(32) float res[8];` or `__attribute__((aligned(32)))`. Note the function prologue on https://godbolt.org/z/NmBBru. Even the OP's gcc4.6 had that: https://godbolt.org/z/fZq5tT with a more complicated prologue. – Peter Cordes Jun 11 '20 at 12:43
The bug you linked was about people breaking the alignment mechanism by using `-mpreferred-stack-boundary=2` and GCC forgetting that it didn't have 16-byte alignment for free. > 16 it always needs to align the stack. – Peter Cordes Jun 11 '20 at 12:44
@PeterCordes The bug I linked is a confirmation that `__attribute__((aligned))` doesn't (or at least didn't) work when the variable is placed on the stack. It doesn't matter what flags the reporters passed to the compiler (as long as it's not something fundamentally pathological), the important part is that the requested alignment is not fulfilled. The bug is still open, so I'm not sure it is fixed in all cases, even if the current compilers appear to align the stack. – Andrey Semashev Jun 11 '20 at 12:53
@PeterCordes >AVX code can always fold loads without having to prove alignment first, an improvement over SSE. -- This is not the point. The point is that the alignment specified in the code may not be respected in run time, and this can cause the crash. Whether the compiler will fold the loads or not is largely irrelevant; on that account I'm just noting that the compiler may or may not fold at its own discretion. – Andrey Semashev Jun 11 '20 at 12:58
Ok, that's what you meant, but unfortunately your answer phrased it in a way that implies guaranteed alignment is relevant to load folding. Also, `__attribute__((aligned(32)))` and C++11 `alignas(32)` *do* work on local arrays with GCC. GCC on Windows has a different bug where auto-vectorization can fail to align the stack, but explicit `__attribute__` still works there. – Peter Cordes Jun 11 '20 at 13:15
Like I said, the bug you linked is only about a case where `-mpreferred-stack-boundary=2` broke `__attribute__((aligned(16)))`. That option isn't safe in general (violates the ABI), the OP isn't using it, and that was a bug that was fixed before GCC4.6 – Peter Cordes Jun 11 '20 at 13:15
@PeterCordes I've updated the answer to make my point more clearly. – Andrey Semashev Jun 11 '20 at 13:59

Aligned and unaligned memory access with AVX/AVX2 intrinsics

2 Answers2

Linked