1

The documentation of _mm256_load_ps states that the memory must be 32bit-aligned in order to load the values into the registers.

So I found that post that explained how an address is 32bit aligned.

#include <immintrin.h>
#include <vector>

int main() {
    std::vector<float> A(height * width, 0);
    std::cout << "&A = " << A.data() << std::endl; // 0x55e960270eb0
    __m256 a_row = _mm256_load_ps(A.data());
    return 0; // Exit Code 139 SIGSEGV 
}

So tried that code. And I expected it to work. I checked the address
0x55e960270eb0 % 4 = 0 and floats are 4 bytes in size.
I am completely baffled by the reason. If I use a raw array with malloc, suddenly everything works

#include <immintrin.h>
#include <vector>

int main() {
    std::vector<float> A(height * width, 0);
    std::cout << "&A = " << A.data() << std::endl; // &A = 0x55e960270eb0


    float* m = static_cast<float*>(_mm_malloc(A.size() * sizeof(float), 32));
    std::cout << "m* = " << m << std::endl; // m* = 0x562bbe989700

    __m256 a_row = _mm256_load_ps(m);

    delete m;

    return 0; // Returns 0
}

What am I missing/misinterpreting ?

JN98ZK
  • 43
  • 1
  • 8
  • 5
    You missread [this](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_load_ps&expand=3333) - it says 32 BYTE aligned, not BIT. For stack variables you may align them through code `alignas(32) std::array a;` – Arty Jun 09 '21 at 14:45
  • @Arty Thank you very much. I wouldn't have anticipated that – JN98ZK Jun 09 '21 at 14:47
  • Sometimes `alignas` isn't good enough. (It's been a long while, so my recollection is fuzzy.) If you have a MMU block that has to be 256 byte aligned, or a DMA buffer that has to be 4096 byte aligned you may need special custom alignment handling that the compiler can't provide. For AVX like this this scenario, I'd expect the compiler to be able to handle that. – Eljay Jun 09 '21 at 14:54

1 Answers1

3

You missread this - it says 32 BYTE aligned, not BIT.

So you have to do 32-byte alignment instead of 4-byte alignment.

To align any stack variable you can use alignas(32) T var;, where T can be any type for example std::array<float, 8>.

To align std::vector's memory or any other heap-based structure alignas(...) is not enough, you have to write special aligning allocator (see Test() function for example of usage):

Try it online!

#include <cstdlib>
#include <memory>

// Following includes for tests only
#include <vector>
#include <iostream>
#include <cmath>

template <typename T, std::size_t N>
class AlignmentAllocator {
  public:
    typedef T value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;
    typedef T * pointer;
    typedef const T * const_pointer;
    typedef T & reference;
    typedef const T & const_reference;

  public:
    inline AlignmentAllocator() throw() {}
    template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
    inline ~AlignmentAllocator() throw() {}
    inline pointer adress(reference r) { return &r; }
    inline const_pointer adress(const_reference r) const { return &r; }
    inline pointer allocate(size_type n);
    inline void deallocate(pointer p, size_type);
    inline void construct(pointer p, const value_type & wert);
    inline void destroy(pointer p) { p->~value_type(); }
    inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
    template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
    bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
    bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};

template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
    #if _MSC_VER
        return (pointer)_aligned_malloc(n * sizeof(value_type), N);
    #else
        void * p0 = nullptr;
        int r = posix_memalign(&p0, N, n * sizeof(value_type));
        if (r != 0) return 0;
        return (pointer)p0;
    #endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
    #if _MSC_VER
        _aligned_free(p);
    #else
        std::free(p);
    #endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
    new (p) value_type(wert);
}

template <typename T, size_t N = 64>
using AlignedVector = std::vector<T, AlignmentAllocator<T, N>>;

template <size_t Align>
void Test() {
    AlignedVector<float, Align> v(1);
    size_t uptr = size_t(v.data()), alignment = 0;
    while (!(uptr & 1)) {
        ++alignment;
        uptr >>= 1;
    }
    std::cout << "Requested: " << Align << ", Actual: " << (1 << alignment) << std::endl;
}

int main() {
    Test<8>();
    Test<16>();
    Test<32>();
    Test<64>();
    Test<128>();
    Test<256>();
}

Output:

Requested: 8, Actual: 16
Requested: 16, Actual: 16
Requested: 32, Actual: 32
Requested: 64, Actual: 128
Requested: 128, Actual: 8192
Requested: 256, Actual: 256

You may see in code above that I used posix_memalign() for CLang/GCC and _aligned_malloc() for MSVC. Starting from C++17 there also exists std::aligned_alloc() but seems that not all compilers implemented it, at least MSVC didn't. Looks like on CLang/GCC you can use this std::aligned_alloc() instead of posix_memalign() as commented by @Mgetz.

Also as Intel guide says here you can use _mm_malloc() and _mm_free() instead of posix_memalign()/_aligned_malloc()/_aligned_free()/std::aligned_alloc()/std::free().

Arty
  • 14,883
  • 6
  • 36
  • 69
  • 1
    This would be very nice. I want to avoid raw pointers. – JN98ZK Jun 09 '21 at 14:52
  • @JohnZakariaAbdElMesiih Just wrote allocator for you, see my updated answer with code! Ask if you have any questions regarding usage of it. Example of usage is shown in my code in `Test()` function. This allocator can be used for any std:: structure not just, std::vector, for example it can be used for std::map, std::set, std::list, etc. – Arty Jun 09 '21 at 15:05
  • 2
    Note post C++17 `posix_memalign` isn't necessary on non-MSVC platforms the standard now has [`std::aligned_alloc`](https://en.cppreference.com/w/cpp/memory/c/aligned_alloc) which has pretty much identical but in theory more portable semantics (except to the MS C stdlib). – Mgetz Jun 09 '21 at 15:19
  • @Mgetz So you're saying that everywhere except MSVC I can use `std::aligned_alloc()`? – Arty Jun 09 '21 at 15:22
  • @Mgetz Added a note to the end of my answer regarding `posix_memalign()` and `std::aligned_alloc()`. – Arty Jun 09 '21 at 15:25
  • @Arty If and only if you're using C++17 assuming the implementation supports it etc.. it's part of the standard now at least. Windows allocators are generally thin wrappers around `HeapAlloc` which does not itself support alignment (why MS couldn't do a `HeapAllocEx` they won't say...) and the standard requires that it can release via `std::free` which isn't possible there. – Mgetz Jun 09 '21 at 15:37
  • 1
    C++17 does make `std::vector<__m256>` work properly now (i.e. respecting the alignment of types with more than `alignof(max_align_t)` alignment requirements). But normally you want `std::vector` and `_mm256_load_ps` (or `loadu`) on that data. ([Is it good or bad (performance-wise) to use std::vector](https://stackoverflow.com/q/66062171), and semi-related: [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726) re: loadu efficiency with GCC) – Peter Cordes Jun 10 '21 at 01:28
  • Why haven't you inherited from std::allocator ? When I tried it everything went crashing down. Are there no Java Like interfaces that tell us what to do ? – JN98ZK Jun 11 '21 at 07:46
  • @JN98ZK I used code above in several of my programs without any crash. What operating system you have? (Windows/Linux/MacOS) Do you compile 32-bit or 64-bit binary? What compiler (MSVC/CLang/GCC)? It could be the case that you're aligning on wrong `N`, alignment `N` should be `>= 8` always and also power of 2, i.e. `N == 2^K`. If you provide wrong `N` value then standard allocation functions will just crash without error. Can you please give us a link to your code that crashes, some minimal example? – Arty Jun 11 '21 at 08:15
  • @JN98ZK Also I didn't inherit from std::allocator not to spoil class with wrong functions. Because some of methods of std::allocator might be reused and some not, but C++ doesn't allow you to choose to inherit just some of methods when doing inheritance. Hence I decide to reimplement all methods from scratch. – Arty Jun 11 '21 at 08:16
  • Your code works fine for me. I was trying to get my head around C++. I am just heavily influenced by Java/Python concepts and was simply tinkering around with it. Do you have any book recommendations on such advanced C++ topics? – JN98ZK Jun 11 '21 at 09:11
  • @JN98ZK I think I didn't even read any book about C++, mostly I learned all advanced features just by Googling many topics. Also solving contest problems on many sites like [LeetCode](https://leetcode.com/problemset/all/). – Arty Jun 11 '21 at 09:18