Why does this AVX intrinsic cause "Segmentation fault" with clang, but not GCC?

Question

It seems two functions below can cause segmentation fault when compiled with clang using -mavx (or -march=sandybridge -> skylake).

void _mm256_mul_double_intrin(double* a, double* b, int N)
{
    int nb_iters = N / ( sizeof(__m256d) / sizeof(double) );

    __m256d* l = (__m256d*)a;
    __m256d* r = (__m256d*)b;

    for (int i = 0; i < nb_iters; ++i, ++l, ++r)
        _mm256_store_pd((double *)l, _mm256_mul_pd(*l, *r));

}

void _mm256_mul_double(double* a, double* b, int N)
{
    int nb_iters = N / ( sizeof(__m256d) / sizeof(double) );

    __m256d* l = (__m256d*)a;
    __m256d* r = (__m256d*)b;

    for (int i = 0; i < nb_iters; ++i, ++l, ++r)
        __asm__(
            "vmulpd %[r], %[l], %[l] \t\n"
            : [l] "+x" (*l)
            : [r] "m" (*r)
            :
        );
}

When N is 2 fold or more of 4 (ymm register width / double width), the clang compiled code sometimes cause segmentation fault. (see wandbox link below)

GCC compiled code seems okey.

godbolt.org/g/YPa7mU

wandbox.org/permlink/kex4e3lRCKfPAq2J

** I found the original source code here on stackoverflow.com

@BeeOnRope I don't claim anything at all. wandbox.org does : https://wandbox.org/permlink/kex4e3lRCKfPAq2J — sandthorn, Oct 30 '17 at 08:43
It's not just the intrinsics, writing `*l` is already wrong. — Marc Glisse, Oct 30 '17 at 08:52
With gcc (didn't check clang), you can do `typedef double uvec __attribute__ ((__vector_size__ (32), __may_alias__, __aligned__ (1)));` and use uvec instead of `__m256d`, this lets the compiler know that it should produce unaligned loads/stores. Note that gcc tends to align arrays more than required, for performance reasons, which might explain why it "works" in your case, although it could also be purely random. — Marc Glisse, Oct 30 '17 at 10:08

Peter Cordes · Accepted Answer · 2017-10-30T11:34:26.893

The answer is right there in the asm you linked on Godbolt:

gcc uses andq $-32, %rsp to align the stack by 32, so all the alignment-required loads and stores in your code don't fault. (Dereferencing a __m256d*, and _mm256_store_pd instead of _mm256_storeu_pd). AVX instructions don't generally require alignment, but the aligned-move instructions (likevmovapd) do.

This is only possible for gcc because your test-case lets the functions using __m256d operations on double a[] and double b[] inline into the function that allocates the array on the stack.

For example:

void ext(double *);
void foo(void) {
    double tmp [1024];
    ext(tmp);
}

compiles to simple allocation with no over-aligning the stack.

    subq    $8200, %rsp
    movq    %rsp, %rdi
    call    ext(double*)
    addq    $8200, %rsp
    ret

The x86-64 SysV ABI only requires 16B stack alignment. (And gcc doesn't choose to maintain more than that.) So if ext() was actually one of your functions that required 32-byte alignment of the double*, it would fault.

gcc doesn't know that 32B-alignment would be a performance boost for ext(), so it doesn't spend the instructions to align all automatic-storage arrays. If there's a correctness problem, that's your fault!

Clang doesn't do any alignment even after inlining, and just reserves space on the stack with subq $248, %rsp. So even in your test-case, stack address-space randomization will only give you a 32B-aligned stack half the time.

If you used alignas(32) double a[], all compilers would be required to align the array. (alignas doesn't work for dynamic storage like new or malloc, but it does work for automatic and static arrays. For dynamic, see How to solve the 32-byte-alignment issue for AVX load/store operations?).

@sandthorn yes, with C++11 `alignas(32)`, see updated answer — Peter Cordes, Oct 30 '17 at 11:34
How can I put "alignas" into function parameter to give more restraints? If not, will there be one in "Concept"? — sandthorn, Oct 30 '17 at 12:04
@sandthorn: You can't. IDK if a future version of C++ would let you make it a compile-time error to pass an under-aligned array to a function that requires aligned pointers, and track minimum-alignments of pointers as they're passed around. (It is helpful for compilers to know when pointers are aligned, especially gcc with default `-mtune=generic` or Sandybridge/Ivybridge, where gcc splits unaligned 256b loads/stores into 128b halves with `vmovdqu` / `vinsertf128`. (So it's actually less efficient if it turns out that the pointer is aligned at run-time.) — Peter Cordes, Oct 30 '17 at 12:21

keith · Answer 2 · 2017-10-30T10:14:50.350

2

Probably down to memory alignment, however, modern processors can read/write unaligned memory as efficiently as unaligned memory (well very very nearly as efficient) so use _mm256_loadu_pd(r) instead of *r and _mm256_loadu_pd(l) instead of *l and also _mm256_storeu_pd to store the variable.

edited Oct 30 '17 at 10:14

answered Oct 30 '17 at 09:55

keith

5,122
3
21
50

1

@Paul R, yes I forgot to mention making the store unaligned, however, the aligned load will still be a problem. – keith Oct 30 '17 at 10:09
2

@PaulR vmulpd cannot take both arguments from memory... Anyway intrinsics work at a higher level than that and don't know about memory vs registers. And `*l` is equivalent to `_mm256_load_pd`, even if in some cases it can be optimized away. – Marc Glisse Oct 30 '17 at 10:11
1

@MarcGlisse: yes, that's true. Perhaps keith can clarify this in the answer then. – Paul R Oct 30 '17 at 10:13
1

@Paul R, I won't begrudge someone writing a better answer. I've found intrinsics are a bit of a black art to learn what works and how to avoid surprises. I'm not sure I can summarise that scientifically. – keith Oct 30 '17 at 10:27
@keith: sorry for the noise - I didn't read the original question properly - down-vote changed to up-vote. – Paul R Oct 30 '17 at 10:29

Why does this AVX intrinsic cause "Segmentation fault" with clang, but not GCC?

2 Answers2