2

When I run the code on my machine, the program goes segmentation fault.

#include <immintrin.h>
#include <stdint.h>

static inline __m256i load_vector(__m256i const * addr){
    __m256i res = _mm256_load_si256(addr);
    return res;
}
void test2(){
    int32_t *src;
    src = _mm_malloc(sizeof(__m256i), 32);
    __m256i vec = load_vector((__m256i const * )src);
    _mm_free(src);
}

int main(int argc,char *argv[]){
    test2();
    return 0;
}

I tried to debug this with gdb and it goes segmentation fault when _mm256_load_si256 is called.

I run the code on cygwin gcc on AMD 2990wx CPU. How can be happen such things?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Kuroda
  • 103
  • 6
  • 2
    Works on my machine; I don't see anything wrong there. You might try looking more closely with gdb to see what went wrong. What instruction generated the segfault? – Jason R Mar 08 '19 at 00:02
  • Is cygwin gcc's `_mm_malloc` broken and not returning 32-byte aligned memory? – Peter Cordes Mar 08 '19 at 08:11
  • Reading uninitialized memory is Undefined Behavior: https://stackoverflow.com/a/37184840 – chtz Mar 08 '19 at 13:05
  • 2
    @chtz Technically it's UB, but we can do better than that. I don't see how that can cause the OP's segfault. @OP since you're using cygwin, that probably means Windows. What compiler flags are you using? If it's `-O0` then it's possible that `res` is being put on the stack. [And GCC has a stack alignment problem that has made AVX unusable on Windows since antiquity.](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412) – Mysticial Mar 08 '19 at 20:23
  • @Mysticial I agree that this is unlikely the cause of the segfault. I therefore just posted it as a comment (of course, I could have made it more clear that this is likely unrelated). – chtz Mar 09 '19 at 20:38
  • @Kuroda: post your answer as an *answer*, not an edit to the question. Your debug results show that Mysticial was right, you're suffering from a cygwin-gcc bug. I have no idea why that bug isn't fixed or even exists in the first place; gcc does have to manually align the stack on Linux before it can spill a `__m256i` but has no trouble doing so. Presumably you'd be fine with clang. – Peter Cordes Mar 10 '19 at 12:55
  • @Mysticial and guys, thanks for your comments. – Kuroda Mar 10 '19 at 17:36

1 Answers1

3

I did further debug. _mm_malloc wasn't the problem, it was alignment of local variables.

At the second vmovdqa to store the vector into the caller's pointer, RAX was not 32-byte aligned. vec in test2 seems not to be aligned. (Cygwin/mingw return the __m256i vector by reference with the caller passing a hidden pointer, unlike the standard Windows x64 calling convention that return it by value).

This is the known Cygwin bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412) that Mysticial linked in comments: Cygwin GCC can't safely use AVX because it doesn't properly align the stack for __m256i locals that get stored to memory. (Cygwin/MinGW gcc will properly align alignas(32) int arr[8] = {0};, but they do it by aligning a separate pointer, not RSP or RBP. Apparently there's some SEH limitation on stack frame manipulation)

Clang, MSVC, and ICC all support __m256i properly.

With optimization enabled gcc often won't make faulting code, but sometimes even optimized code will store/reload a 32-byte vector to the stack.

_ZL11load_vectorPKDv4_x:
.LFB3671:
    .file 2 "min_case.c"
    .loc 2 4 0
    .cfi_startproc
    pushq   %rbp
    .seh_pushreg    %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .seh_setframe   %rbp, 0
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    .seh_stackalloc 16
    .seh_endprologue
    movq    %rcx, 16(%rbp)
    movq    %rdx, 24(%rbp)
    movq    24(%rbp), %rax
    movq    %rax, -8(%rbp)
.LBB4:
.LBB5:
    .file 3 "/usr/lib/gcc/x86_64-pc-cygwin/7.4.0/include/avxintrin.h"
    .loc 3 909 0
    movq    -8(%rbp), %rax
    vmovdqa (%rax), %ymm0
.LBE5:
.LBE4:
    .loc 2 5 0
    movq    16(%rbp), %rax
    vmovdqa %ymm0, (%rax)
    .loc 2 6 0
    movq    16(%rbp), %rax
    addq    $16, %rsp
    popq    %rbp
    .cfi_restore 6
    .cfi_def_cfa 7, 8
    ret

__m256i was not aligned in this test-case:

#include <immintrin.h>
#include <stdint.h>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

const char* check_alignment(const void *ptr, uintptr_t alignment){
    return (((uintptr_t)ptr) & (alignment - 1)) == 0 ? "aligned" : "NOT aligned";
}

static inline __m256i load_vector(__m256i const * addr){
    printf("addr:%s\n", check_alignment(addr, 32));
    __m256i res;
    printf("&res:%s\n", check_alignment(&res, 32));
    res = _mm256_load_si256(addr);
    return res;
}
void test2(){
    int32_t *src;
    src = (int32_t *)_mm_malloc(sizeof(__m256i), 32);
    src[0] = 0; src[0] = 1; src[2] = 2; src[3] = 3;
    src[4] = 4; src[5] = 5; src[6] = 6; src[7] = 7;
    __m256i vec = load_vector((__m256i const * )src);
    _mm_free(src);
}

int main(int argc,char *argv[]){
    test2();
    return 0;
}

// results
// addr:aligned
// &res:NOT aligned
// Segmentation fault
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Kuroda
  • 103
  • 6