21

The following question is related, however answers are old, and comment from user Marc Glisse suggests there are new approaches since C++17 to this problem that might not be adequately discussed.

I'm trying to get aligned memory working properly for SIMD, while still having access to all of the data.

On Intel, if I create a float vector of type __m256, and reduce my size by a factor of 8, it gives me aligned memory.

E.g. std::vector<__m256> mvec_a((N*M)/8);

In a slightly hacky way, I can cast pointers to vector elements to float, which allows me to access individual float values.

Instead, I would prefer to have an std::vector<float> which is correctly aligned, and thus can be loaded into __m256 and other SIMD types without segfaulting.

I've been looking into aligned_alloc.

This can give me a C-style array that is correctly aligned:

auto align_sz = static_cast<std::size_t> (32);
float* marr_a = (float*)aligned_alloc(align_sz, N*M*sizeof(float));

However I'm unsure how to do this for std::vector<float>. Giving the std::vector<float> ownership of marr_a doesn't seem to be possible.

I've seen some suggestions that I should write a custom allocator, but this seems like a lot of work, and perhaps with modern C++ there is a better way?

Paul R
  • 208,748
  • 37
  • 389
  • 560
Prunus Persica
  • 1,173
  • 9
  • 27
  • 1
    *without segfaulting*... or without potential slowdowns from cache-line splits when you use `_mm256_loadu_ps(&vec[i])`. (Although note that with default tuning options, GCC [splits not-guaranteed-aligned 256-bit loads/stores](https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd) into vmovups xmm / vinsertf128. So there *is* an advantage to using `_mm256_load` over `loadu` if you care about how your code compiles on GCC if someone forgets to use `-mtune=...` or `-march=` options.) – Peter Cordes Feb 11 '20 at 13:29
  • @PrunusPersica Did you end up getting this to work ? I have the same problem. We can work together if you wish ? – gansub Aug 25 '20 at 15:39
  • 1
    @gansub I ended up using the code of `boost::alignment::aligned_allocator`. Then I could allocate the vector with `std::vector>`. It does make normal `std::vectors` not directly compatible with this type of aligned vector, but you can always write ways around that. – Prunus Persica Aug 28 '20 at 17:28

2 Answers2

7

STL containers take an allocator template argument which can be used to align their internal buffers. The specified allocator type has to implement at least allocate, deallocate, and value_type.

In contrast to these answers, this implementation of such an allocator avoids platform-dependent aligned malloc calls. Instead, it uses the C++17 aligned new operator.

Here is the full example on godbolt.

#include <limits>
#include <new>

/**
 * Returns aligned pointers when allocations are requested. Default alignment
 * is 64B = 512b, sufficient for AVX-512 and most cache line sizes.
 *
 * @tparam ALIGNMENT_IN_BYTES Must be a positive power of 2.
 */
template<typename    ElementType,
         std::size_t ALIGNMENT_IN_BYTES = 64>
class AlignedAllocator
{
private:
    static_assert(
        ALIGNMENT_IN_BYTES >= alignof( ElementType ),
        "Beware that types like int have minimum alignment requirements "
        "or access will result in crashes."
    );

public:
    using value_type = ElementType;
    static std::align_val_t constexpr ALIGNMENT{ ALIGNMENT_IN_BYTES };

    /**
     * This is only necessary because AlignedAllocator has a second template
     * argument for the alignment that will make the default
     * std::allocator_traits implementation fail during compilation.
     * @see https://stackoverflow.com/a/48062758/2191065
     */
    template<class OtherElementType>
    struct rebind
    {
        using other = AlignedAllocator<OtherElementType, ALIGNMENT_IN_BYTES>;
    };

public:
    constexpr AlignedAllocator() noexcept = default;

    constexpr AlignedAllocator( const AlignedAllocator& ) noexcept = default;

    template<typename U>
    constexpr AlignedAllocator( AlignedAllocator<U, ALIGNMENT_IN_BYTES> const& ) noexcept
    {}

    [[nodiscard]] ElementType*
    allocate( std::size_t nElementsToAllocate )
    {
        if ( nElementsToAllocate
             > std::numeric_limits<std::size_t>::max() / sizeof( ElementType ) ) {
            throw std::bad_array_new_length();
        }

        auto const nBytesToAllocate = nElementsToAllocate * sizeof( ElementType );
        return reinterpret_cast<ElementType*>(
            ::operator new[]( nBytesToAllocate, ALIGNMENT ) );
    }

    void
    deallocate(                  ElementType* allocatedPointer,
                [[maybe_unused]] std::size_t  nBytesAllocated )
    {
        /* According to the C++20 draft n4868 § 17.6.3.3, the delete operator
         * must be called with the same alignment argument as the new expression.
         * The size argument can be omitted but if present must also be equal to
         * the one used in new. */
        ::operator delete[]( allocatedPointer, ALIGNMENT );
    }
};

This allocator can then be used like this:

#include <iostream>
#include <stdexcept>
#include <vector>

template<typename T, std::size_t ALIGNMENT_IN_BYTES = 64>
using AlignedVector = std::vector<T, AlignedAllocator<T, ALIGNMENT_IN_BYTES> >;

int
main()
{
    AlignedVector<int, 1024> buffer( 3333 );
    if ( reinterpret_cast<std::uintptr_t>( buffer.data() ) % 1024 != 0 ) {
        std::cerr << "Vector buffer is not aligned!\n";
        throw std::logic_error( "Faulty implementation!" );
    }

    std::cout << "Successfully allocated an aligned std::vector.\n";
    return 0;
}
mxmlnkn
  • 1,887
  • 1
  • 19
  • 26
  • 2
    C++17 supports over-aligned dynamic allocations, e.g. `std::vector<__m256i>` should Just Work. Is there no way to take advantage of that, instead of using ugly hacks that over-allocate and then leave part of the allocation unused? – Peter Cordes Feb 05 '22 at 00:24
  • @PeterCordes I think this is more a code style than performance issue because the overhead, e.g. 511 B, will be smaller than 1% in most cases. Of course, you can simply use something like `reinterpret_cast( new __m256i[ nBytesToAllocate / sizeof( __m256i ) ] )` as long as the 256 alignment is what you want. Using a dummy struct might be more portable though: `struct DummyAligned{ alignas( 512 ) char[512] dummy; };`. But note that this also will result in overallocation if your vector size is not a multiple of the alignment... – mxmlnkn Feb 05 '22 at 10:28
  • It's also extra bookkeeping to keep track of the address to free, separately from the address you're using. That's the main reason I don't like it. – Peter Cordes Feb 05 '22 at 10:49
  • 1
    @PeterCordes Ok, that is totally understandable. After a further experimentation and reading, I changed my answer to use the C++17 aligned new/delete operators instead. – mxmlnkn Feb 05 '22 at 10:58
  • @user17732522 I'm already checking for trivial types with the static asserts. I got the new/delete from [here](https://developercommunity.visualstudio.com/t/using-c17-new-stdalign-val-tn-syntax-results-in-er/528320). I'm pretty sure it should fit? The new expression also just calls the new operator under the hood (and additionally calls the constructor), afaik. I didn't wanna use operator new because then I would have to convert the number of elements into number of bytes again with all the required overflow checking for that. – mxmlnkn Feb 05 '22 at 11:58
  • @user17732522 Thanks for the suggestion. I'm now using the new operator instead and can therefore remove the `static_assert`s and reduce the code a bit more. The godbolt link also contains an example with an object with a custom constructor and destructor. – mxmlnkn Feb 05 '22 at 13:00
  • I think the allocator is now correct, but unfortunately as mentioned in the other answer's comments, I don't think that it is guaranteed that `std::vector` will actually place its elements at the beginning of the allocation, which could mess up the alignment. (But I don't think any implementation implements vector that way.) – user17732522 Feb 05 '22 at 13:17
  • @user17732522 Interesting edge case. I guess you would have to write your own container to be 100% sure. Or, add automated tests / asserts like I did in godbolt on the `data()` return value. For my usecase, if it works on all known systems, it works well enough. – mxmlnkn Feb 05 '22 at 13:27
  • 1
    MSVC doesn't like this one in debug builds; see https://stackoverflow.com/q/72238649/15416. – MSalters Oct 11 '22 at 14:27
  • @MSalters Thank you for bringing this problem to my attention and also point to the solution. I could reproduce the problem on godbolt by adding `/MTd` and fixed it by adding all three required constructors, especially the templated conversion constructor for an allocator with a different value type. – mxmlnkn Nov 05 '22 at 09:40
0

All containers in the standard C++ library, including vectors, have an optional template parameter that specifies the container's allocator, and it is not really a lot of work to implement your own one:

class my_awesome_allocator {
};

std::vector<float, my_awesome_allocator> awesomely_allocated_vector;

You will have to write a little bit of code that implements your allocator, but it wouldn't be much more code than you already written. If you don't need pre-C++17 support you only need to implement the allocate() and deallocate() methods, that's it.

Sam Varshavchik
  • 114,536
  • 5
  • 94
  • 148
  • They also need to specialize [`allocator_traits`](https://en.cppreference.com/w/cpp/memory/allocator_traits) – NathanOliver Feb 11 '20 at 13:43
  • 2
    This might be a good place for a canonical answer with an example that people can copy/paste to jump through C++'s annoying hoops. (Bonus points if there's a way to let std::vector try to realloc in-place instead of the usual braindead C++ always alloc+copy.) Also of course note that this `vector` is not type-compatible with `vector` (and can't be because anything that does `.push_back` on a plain `std::vector` compiled without this allocator could do a new allocation and copy into minimally-aligned memory. And new/delete isn't compatible with aligned_alloc/free) – Peter Cordes Feb 11 '20 at 13:43
  • 2
    I don't think there is any guarantee that the pointer returned from the allocator is directly used as the base address of the `std::vector`'s array. For example, I could imagine an implementation of `std::vector` using just one pointer to the allocated memory which stores the end/capacity/allocator in the memory prior to the range of values. That could easily foil the alignment done by the allocator. – Dietmar Kühl Feb 11 '20 at 14:10
  • 2
    Except that `std::vector` guarantees it. That's what it uses it for. Perhaps you should review what the C++ standard specifies here. – Sam Varshavchik Feb 11 '20 at 14:23
  • 1
    > They also need to specialize `allocator_traits` -- No, they don't. All that's needed is to implement a compliant allocator. – Andrey Semashev Feb 11 '20 at 14:26
  • > Bonus points if there's a way to let std::vector try to realloc in-place instead of the usual braindead C++ always alloc+copy. -- There is no way, except to reserve the required capacity first and then insert elements as needed. There are good reasons why `realloc` is not an option. `realloc` does not call constructors and copying raw bytes is not valid for most types. Also, `realloc` usefullness is over-estimated, as most of the time increasing allocation size for any considerable amount is still equivalent to `malloc`+`memcpy`+`free`. – Andrey Semashev Feb 11 '20 at 14:31
  • I can add that there is a good implementation of aligned allocator in Boost.Align: https://www.boost.org/doc/libs/1_72_0/doc/html/align/reference.html#align.reference.classes – Andrey Semashev Feb 11 '20 at 14:37
  • And, on topic of `realloc`, it doesn't necessarily preserve alignment. – Andrey Semashev Feb 11 '20 at 14:40
  • the existence, and use of Boost's aligned allocator is sufficient for many needs, though the dependency is unfortunate. `vector` and `vector` not being type compatible also unfortunate, but as long as the underlying data is still floats it's okay. writing allocators is new to me. I made [an attempt here](https://gist.github.com/Wheest/beb7849c237c5e90094dd7d060c9b279), but had a return type error – Prunus Persica Feb 12 '20 at 12:13
  • @AndreySemashev "`realloc` usefulness is over-estimated" - Not true. `realloc` on Linux calls `mremap`, which for large buffers is more efficient than copying. – Nemo Jun 27 '23 at 00:54
  • @Nemo This is only possible if the allocated memory is large enough - one or more contiguous pages - and was allocated using raw `mmap` in the first place. Large allocations are pretty rare. And besides, you have to hope that your `realloc` implementation actually does this `mremap` trick, which is not guaranteed. If your performance depends on large reallocations being performed efficiently, you're better off directly using `mmap`/`mremap` instead of relying on `realloc` possibly doing the right thing. You might eliminate the need to allocate memory entirely and e.g. map the data from a file. – Andrey Semashev Jun 27 '23 at 11:59