_mm256_load_ps cause segmentation fault with google/benchmark in debug mode

Question

The following code can run in both release and debug mode.

#include <immintrin.h>

constexpr int n_batch = 10240;
constexpr int n = n_batch * 8;
#pragma pack(32)
float a[n];
float b[n];
float c[n];
#pragma pack()

int main() {
    for(int i = 0; i < n; ++i)
        c[i] = a[i] * b[i];

    for(int i = 0; i < n; i += 4) {
        __m128 av = _mm_load_ps(a + i);
        __m128 bv = _mm_load_ps(b + i);
        __m128 cv = _mm_mul_ps(av, bv);
        _mm_store_ps(c + i, cv);
    }

    for(int i = 0; i < n; i += 8) {
        __m256 av = _mm256_load_ps(a + i);
        __m256 bv = _mm256_load_ps(b + i);
        __m256 cv = _mm256_mul_ps(av, bv);
        _mm256_store_ps(c + i, cv);
    }
}

The following code can run only in release mode, and get segmentation fault in debug mode.

#include <immintrin.h>

#include "benchmark/benchmark.h"

constexpr int n_batch = 10240;
constexpr int n = n_batch * 8;
#pragma pack(32)
float a[n];
float b[n];
float c[n];
#pragma pack()

static void BM_Scalar(benchmark::State &state) {
    for(auto _: state)
        for(int i = 0; i < n; ++i)
            c[i] = a[i] * b[i];
}
BENCHMARK(BM_Scalar);

static void BM_Packet_4(benchmark::State &state) {
    for(auto _: state) {
        for(int i = 0; i < n; i += 4) {
            __m128 av = _mm_load_ps(a + i);
            __m128 bv = _mm_load_ps(b + i);
            __m128 cv = _mm_mul_ps(av, bv);
            _mm_store_ps(c + i, cv);
        }
    }
}
BENCHMARK(BM_Packet_4);

static void BM_Packet_8(benchmark::State &state) {
    for(auto _: state) {
        for(int i = 0; i < n; i += 8) {
            __m256 av = _mm256_load_ps(a + i); // Signal: SIGSEGV (signal SIGSEGV: invalid address (fault address: 0x0))
            __m256 bv = _mm256_load_ps(b + i);
            __m256 cv = _mm256_mul_ps(av, bv);
            _mm256_store_ps(c + i, cv);
        }
    }
}
BENCHMARK(BM_Packet_8);

BENCHMARK_MAIN();

You forgot `alignas(32) float a[n];` to match the alignment-required load intrinsic you chose. — Peter Cordes, Jun 11 '20 at 12:08
https://learn.microsoft.com/en-us/cpp/preprocessor/pack?view=vs-2019 says `#pragma pack(8)` only affects struct/union/class objects, and that would be 8-byte alignment, not chunks of 8 floats. — Peter Cordes, Jun 11 '20 at 12:20
@PeterCordes Sorry for the typo of `#pragma pack(8)`. The error still exists with `#pragma pack(32)`. — chaosink, Jun 11 '20 at 12:32
Right, because arrays aren't structs, unions, or classes. That pragma has no effect on them. Is that the real answer to your question? The current duplicate covers why there are differences with unoptimized vs. optimized builds. (I assume with GGC or clang? MSVC avoids aligned load/store instructions.) — Peter Cordes, Jun 11 '20 at 12:33
Yes, `alignas(32)` does work. But why `#pragma pack(32)` doesn't work? — chaosink, Jun 11 '20 at 12:35
Because arrays aren't structs, unions, or classes, like I already said. Arrays are a different kind of C++ object. — Peter Cordes, Jun 11 '20 at 12:35
I think my question is more clear, and shows the difference between `#pragma pack(32)` and `alignas(32)`. Could you please cancel the `[duplicate]`? — chaosink, Jun 11 '20 at 13:02

Peter Cordes · Accepted Answer · 2021-03-29T20:26:39.950

Your arrays aren't aligned by 32. You could check this with a debugger.

#pragma pack(32) only aligns struct/union/class members, as documented by MS. C++ arrays are a different kind of object and aren't affected at all by that MSVC pragma. (I think you're actually using GCC's or clang's version of it, though, because MSVC generally uses vmovups not vmovaps)

For arrays in static or automatic storage (not dynamically allocated), the easiest way to align arrays in C++11 and later is alignas(32). That's fully portable, unlike GNU C __attribute__((aligned(32))) or whatever MSVC's equivalent is.

alignas(32) float a[n];
alignas(32) float b[n];
alignas(32) float c[n];

AVX: data alignment: store crash, storeu, load, loadu doesn't explains why there's a difference depending on optimization level: optimized code will fold one load into a memory source operand for vmulps which (unlike SSE) doesn't require alignment. (Presumably the first array happens to be sufficiently aligned.)

Un-optimized code will do the _mm256_load_ps separately with a vmovaps alignment-required load.

(_mm256_loadu_ps will always avoid using alignment-required loads, so use that if you can't guarantee your data is aligned.)

_mm256_load_ps cause segmentation fault with google/benchmark in debug mode

1 Answers1