How to write c++ code that the compiler can efficiently compile to SSE or AVX?

Question

Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance?

If alignment is important for the generated code, how can I let the (visual c++) compiler know that I intend to only pass values with a certain alignment to the function?

When you say AVX, you mean with 256b vectors, right? Because compilers can already compile the usual `_mm_whatever` intrinsics to either the SSE or 128b-AVX (VEX-encoded 3-operand) version of the instruction. It's inconvenient at best to do it with `#define`s or wrapper functions to choose `_mm_*` vs. `_mm256_*` versions, esp. if the 256b version needs an extra permute, or to take advantage of an AVX-only instruction. So as Z Boson says, auto-vectorization is your best bet if you can get the compiler to do a good job. — Peter Cordes, Nov 04 '15 at 18:26

Z boson · Answer 1 · 2015-11-04T20:58:12.137

5

~~In theory alignment should not matter on Intel processors since Nehalem. Therefore, your compiler should be able to produce code in which a pointer being aligned or not is not an issue.~~

Unaligned load/store instructions have the same performance on Intel processors since Nehalem. However, until AVX arrived with Sandy Bridge unaligned loads could not be folded with another operation for micro-op fusion.

Additionally, even before AVX to avoid the penalty of cache line splits having 16 byte aligned memory could still be helpful so it would still be reasonable for a compiler to add code until the pointer is 16 byte aligned.

Since AVX there is no advantage to using aligned load/store instructions anymore ~~and there is no reason for a compiler to add code to make a pointer 16 byte or 32 byte aligned.~~.

However, there is till a reason to use aligned memory to avoid cache-line splits with AVX. Therefore, it would would be reasonable for a compiler to add code to make the pointer 32 byte aligned even if it still used an unaligned load instruction.

So in practice some compilers produce much simpler code when they are told to assume that a pointer is aligned.

I'm not aware of a method to tell MSVC that a pointer is aligned. With GCC and Clang (since 3.6) you can use a built in __builtin_assume_aligned. With ICC and also GCC you can use #pragma omp simd aligned. With ICC you can also use __assume_aligned.

For example with GCC compiling this simple loop

void foo(float * __restrict a, float * __restrict b, int n)
{
    //a = (float*)__builtin_assume_aligned (a, 16);
    //b = (float*)__builtin_assume_aligned (b, 16);
    for(int i=0; i<(n & (-4)); i++) {
        b[i] = 3.14159f*a[i];
    }
}

with gcc -O3 -march=nehalem -S test.c and then wc test.s gives 160 lines. Whereas if use __builtin_assume_aligned then wc test.s gives only 45 lines. When I did this with in both cases clang return 110 lines.

So on clang informing the compiler the arrays were aligned made no difference (in this case) but with GCC it did. Counting lines of code is not a sufficient metric to gauge performance but I'm not going to post all the assembly here I just want to illustrate that your compiler may produce very different code when it is told the arrays are aligned.

Of course, the additional overhead that GCC has for not assuming the arrays are aligned may make no difference in practice. You have to test and see.

In any case, if you want to get the most most from SIMD I would not rely on the compiler to do it correctly (especially with MSVC). Your example of matrix*vector is a poor one in general (but maybe not for some special cases) since it's memory bandwidth bound. But if you choose matrix*matrix no compiler is going to optimize that well without a lot of help which does not conform to the C++ standard. In these cases you will need intrinsics/built-ins/assembly in which you have explicit control of the alignment anyway.

Edit:

The assembly from GCC contains a lot of extraneous lines which are not part of the text segment. Doing gcc -O3 -march=nehalem -S test.c and then using objdump -d and counting the lines in the text (code) segment gives 108 lines without using __builtin_assume_aligned and only 16 lines with it. This shows more clearly that GCC produces very different code when it assumes the arrays are aligned.

Edit:

I went ahead and tested the foo function above in MSVC 2013. It produces unaligned loads and the code is much shorter than GCC (I only show the main loop here):

$LL3@foo:
    movsxd  rax, r9d
    vmulps  xmm1, xmm0, XMMWORD PTR [r10+rax*4]
    vmovups XMMWORD PTR [r11+rax*4], xmm1
    lea eax, DWORD PTR [r9+4]
    add r9d, 8
    movsxd  rcx, eax
    vmulps  xmm1, xmm0, XMMWORD PTR [r10+rcx*4]
    vmovups XMMWORD PTR [r11+rcx*4], xmm1
    cmp r9d, edx
    jl  SHORT $LL3@foo

This should be fine on processors since Nehalem (late 2008). But MSVC still has cleanup code for arrays that are not a multiple of four even thought I told the compiler that it was a multiple of four ((n & (-4)). At least GCC gets that right.

Since AVX can fold unalinged loads I checked GCC with AVX to see if the code was the same.

void foo(float * __restrict a, float * __restrict b, int n)
{
    //a = (float*)__builtin_assume_aligned (a, 32);
    //b = (float*)__builtin_assume_aligned (b, 32);
    for(int i=0; i<(n & (-8)); i++) {
        b[i] = 3.14159f*a[i];
    }
}

without __builtin_assume_aligned GCC produces 168 lines of assembly and with it it only produces 17 lines.

edited Nov 04 '15 at 20:58

answered Nov 04 '15 at 10:00

Z boson

32,619
11
123
226

1

Loads and stores that cross a cache-line boundary (64B) are still slower even on the most recent CPUs. The Nehalem-and-later no-unaligned-penalty only applies within a cache line. With sequential 32B loads and stores, every other one will span two cache lines. The penalty is fairly small compared to a cache miss, but unless you need a *lot* of small allocations, it's prob. best just to use 32B-aligned allocations always. Or to use the preprocessor to detect whether the target supports AVX, and if so, use 32B-aligned allocations. – Peter Cordes Nov 04 '15 at 18:31
Also, for SSE knowing the data is aligned allows the compiler to use fold loads into other SSE instructions, instead of a separate `movups`. (SSE memory operands to non-mov instructions work like movaps. AVX memory operands to non-mov instructions work like movups.) – Peter Cordes Nov 04 '15 at 18:33
@PeterCordes, I'm not saying aligned memory does not matter anymore. I'm saying aligned instructions don't matter since Nehalem. The compiler should not need to know the pointers are aligned to produce optimal auto-vectorized code. – Z boson Nov 04 '15 at 19:29
@PeterCordes, as to your second comment. I thought we already went over this. You even correct my answer. The folding does not need to me aligned. Maybe at a hardware level, but not a software level. The compiler can still fold a movups and a mulps into one instruction. Look [here](http://stackoverflow.com/questions/31089502/aligned-and-unaligned-memory-access-with-avx-avx2-intrinsics/31098686#31098686). MSVC still folded even with an unaligned load. GCC does that now as well. – Z boson Nov 04 '15 at 19:30
1

Everything you say is only true for AVX. If you want optimal code for **non-AVX** targets from the same code, you do need to tell the compiler the data is aligned. The example you linked folds an unaligned load into a VEX-coded instruction, which is safe. It would *not* be safe (i.e. possible) to do the same for non-VEX `mulps`. I think some compilers will try to reach an aligned pointer before starting in on the vector loop, rather than just using whatever the input alignment is, so telling the compiler the data is always aligned can eliminate bulky dead code and a couple startup checks. – Peter Cordes Nov 04 '15 at 19:42
@PeterCordes, thanks, I was not aware that only applied to AVX. But I don't think any compiler tries to reach a 64 byte aligned pointer just to make it cache-line aligned but I agree they may do this to be 16 byte aligned. – Z boson Nov 04 '15 at 19:46
@PeterCordes, okay, I updated my answer. I think my original claim is correct since Sandy Bridge. GCC still produces very different code for AVX when it's told the arrays are 32 byte aligned. I agree 64 byte alignment could make a difference but I did not tell the compiler this. My claim is that if the compiler is only told the array is 16 byte or 32 byte aligned it should produce the same code as if the arrays were 4 byte aligned. – Z boson Nov 04 '15 at 20:35
[Here's an example](http://goo.gl/10u2qa) of a function being much more bloated when the compiler doesn't know the pointer is aligned. A compile-time-constant loop count is a bit of a special case, since it lets the aligned version do without a cleanup loop at all. With two args, the C99 `restrict` keyword (and supported by some C++ compilers with the `__restrict__` keyword) can also avoid the check to fall back to a scalar loop when the buffers overlap. – Peter Cordes Nov 04 '15 at 20:36
You never need 64B alignment, except to avoid the array using a partial cache-line at the start *and* the end. (or for AVX512). 32B avoids cache-line splits with AVX. It doesn't matter whether the start of the buffer starts half way through a cache line or not, but it does matter if every other 256b op splits a cache-line. Compilers do sometimes generate multiple versions of loops, for different relative alignments of input pointers. (e.g. `a%16 == b%16` or not). – Peter Cordes Nov 04 '15 at 20:41
@PeterCordes, I think it's clear from the `foo` function in my answer than GCC produces a lot more bloat when the compiler does not know the pointers are aligned. But my point is that for AVX it should not produce any more bloat. The only differrence should be a `loadu` vs a `loada` – Z boson Nov 04 '15 at 20:42
re: your update: Don't mix up implementation with instruction set. Sandybridge introduced AVX, but you still can't fold unaligned memory references into SSE instructions, even if your code will only run on SnB. It's better to describe it as an AVX instruction set feature, not a Sandybridge hardware feature. Since this question is about how to write source code that compiles to good SSE *or* AVX code, you need to do everything necessary for good SSE code, even stuff that's not needed for the compiler to make good AVX code. None of the extra alignment *hurts* AVX code. – Peter Cordes Nov 04 '15 at 20:44
@PeterCordes, okay updated again though by Sandy Bridge I implicitly meant AVX I see how it could be misread. – Z boson Nov 04 '15 at 20:49
Looks fairly good now. I would have said some of it differently, and maybe put a summary at the top. (like, "The requirements are essentially the same for auto-vectorizing into good SSE or AVX code. Telling the compiler about any alignment and non-overlap guarantees can avoid a lot of code bloat".) I think we got side-tracked by folding loads, which isn't really essential for code that's not bottlenecked on the front-end. Also, I think you downplay the importance of having your buffers actually aligned too much. I agree it's not that important that the compiler know about it, – Peter Cordes Nov 04 '15 at 20:55
@PeterCordes, I updated my answer yet again before I read your last comment. I see your point about having the memory aligned to avoid cache line splits so it would be reasonable for the compiler to add extra code to make the memory aligned. – Z boson Nov 04 '15 at 20:59

score 3 · Answer 2 · answered Nov 16 '15 at 10:49

My original answer became too messy to edit so I am adding a new answer here and making my original answer community wiki.

I did some tests using aligned and unaligned memory on a pre Nehalem system and on a Haswell system with GCC, Clang, and MSVC.

The assembly shows that only GCC adds code to check and fix alignment. Due to this with __builtin_assume_aligned GCC produces much simpler code. But using __builtin_assume_aligned with Clang only changes unaligned instructions to aligned (the number of instructions stay the same). MSVC just uses unaligned instructions.

The results in performance is that on per-Nehalem systems Clang and MSVC are much slower than GCC with auto-vectorization when the memory is not aligned.

But the penalty for cache-line splits is small since Nehalem. It turns out the extra code GCC adds to check and align the memory more than makes up for the small penalty due to cache-line splits. This explains why neither Clang nor MSVC worry about cache-line splits with vectorization.

So my original claim that auto-vecorization does not need to know about the alignment is more or less correct since Nehalem. That's not the same thing as saying that aligning memory is not useful since Nehalem.

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

2 Answers2

Linked