_mm256_load_ps segmentation fault

Question

I'm developing a high throughput low latency real-time program that involves several matrix operations.
I have decided to use AVX2 or AVX512 to boost the performance of system. This is my first first attempt to use AVX instruction set of SIMD in general.
I'm using the AVX Intrinsics functions available in g++.
The problem I am facing is when I use _mm256_load_ps function I get segmentation fault error but when I use _mm256_set_ps the program runs.
I was told _mm256_load_ps will have better performance than _mm256_set_ps in my application. What am I doing wrong?
This is a program to use AVX2 to add 2 matrices.

Code

#include <immintrin.h>
#include <string.h>

const std::uint64_t MAX_COUNT = 100000;
int main()    
{
    float mat1[MAX_COUNT], mat2[MAX_COUNT], rslt[MAX_COUNT];
    for(int i = 0; i < MAX_COUNT; i++){
        mat1[i] = i;
        mat2[i] = 100-i;
    }
    
    for(int i = 0; i < MAX_COUNT; i +=8)
    {
        //Working Properly
        //auto avx_a = _mm256_set_ps(mat1[i+7], mat1[i+6], mat1[i+5], mat1[i+4], mat1[i+3], mat1[i+2], mat1[i+1], mat1[i+0]);
        //Working Properly
        //auto avx_b = _mm256_set_ps(mat2[i+7], mat2[i+6], mat2[i+5], mat2[i+4], mat2[i+3], mat2[i+2], mat2[i+1], mat2[i+0]);
        //Resulting in segmentation fault
        auto avx_a = _mm256_load_ps(&mat1[i]);
        //Resulting in segmentation fault
        auto avx_b = _mm256_load_ps(&mat2[i]);
        auto avx_c = _mm256_add_ps(avx_a, avx_b);
        float *result = (float*)&avx_c;
        memcpy(&rslt[i], result, 8*sizeof(float));
    }
    
    return 0;
}

Aligning Data

__declspec(align(32)) float mat1[MAX_COUNT]

Error

test_2.cpp: In function ‘int main()’:
test_2.cpp:11:21: error: too few arguments to function ‘void* std::align(std::size_t, std::size_t, void*&, std::size_t&)’
   11 |     __declspec(align(32)) float mat1[MAX_COUNT];
      |                ~~~~~^~~~
In file included from /usr/include/c++/11/memory:72,
                 from /usr/include/x86_64-linux-gnu/c++/11/bits/stdc++.h:82,
                 from test_2.cpp:2:
/usr/include/c++/11/bits/align.h:62:1: note: declared here
   62 | align(size_t __align, size_t __size, void*& __ptr, size_t& __space) noexcept
      | ^~~~~
test_2.cpp:11:5: error: ‘__declspec’ was not declared in this scope
   11 |     __declspec(align(32)) float mat1[MAX_COUNT];
      |     ^~~~~~~~~~

@chtz: Copy unaligned memory to aligned memory is typically _more_ expensive than just taking the unaligned-load penalty, especially now that modern CPU's will automatically use aligned loads when that's possible at runtime. IOW, `_mm256_loadu_ps(p)` is just as fast as `_mm256_load_ps(p)` for aligned `p`, without the downside of failing on unaligned `p`. This makes `_mm256_load_ps(p)` strictly inferior and de facto obsolete. — MSalters, Nov 10 '22 at 12:22
@MSalters: That's why my answer on that linked questions starts by pointing out that you can and should use `_mm256_loadu_ps` if you can't easily change the rest of your program to only ever pass aligned buffers to your function. At least that's how I intended the top of it to read. Nobody suggested memcpy to aligned memory, especially not once per `_mm_load`. Although to be fair, we sometimes see that terrible strategy in SO questions from beginners who hadn't discovered `_mm_loadu`. — Peter Cordes, Nov 10 '22 at 15:11
@MSalters: `_mm256_load_ps` / `_mm256_store_ps` aren't obsolete; they're useful for *checking* alignment, like a free `assert(this is aligned)`. Especially in a debug build where the load won't fold into a memory operand which doesn't require alignment with AVX. It also communicates that alignment guarantee to the compiler, which is useful for GCC10 and earlier where `tune=generic` included `-mavx256-split-unaligned-load` ([Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](//stackoverflow.com/q/52626726)) because of Sandybridge (before Haswell) and Bulldozer-family. — Peter Cordes, Nov 10 '22 at 15:19
@PeterCordes, @MSalters, I tried to align the array using `__declspec(align(8))`, but I am getting the error as updated in original post. I am using g++11 on Linux. Final build will be on vxWorks. — Dark Sorrow, Nov 11 '22 at 04:51
An aligned 32-byte load need 32-byte alignment, not just 8. Did you read the linked duplicate? `alignas(32)` works portably in C++11. Also, `__declspec` is MSVC-only, so of course you get errors if you try to compile it with GCC. — Peter Cordes, Nov 11 '22 at 05:08
@PeterCordes I tried with __declspec(align(32)) and __declspec(align(64)) but I am getting the same error : `error: too few arguments to function ‘void* std::align` — Dark Sorrow, Nov 11 '22 at 05:10
See my edit to the previous comment, and the linked duplicate. — Peter Cordes, Nov 11 '22 at 05:11

score 3 · Accepted Answer · answered Nov 10 '22 at 12:16

3

_mm256_load_ps requires aligned memory. _mm256_set_ps doesn't even require contiguous addresses.

You want _mm256_loadu_ps - unaligned load, but still from a contiguous array.

answered Nov 10 '22 at 12:16

MSalters

173,980
10
155
350

1

Which is the preferred function for low latency real time programming? Should I use `_mm256_loadu_ps` or `_mm256_set_ps`? Which will cost least overheads and execute more quickly. Memory (size) consideration are secondary. – Dark Sorrow Nov 11 '22 at 04:46

_mm256_load_ps segmentation fault

1 Answers1