If alignas(32)double
compiled, it would require that each element separately had 32-byte alignment, i.e. pad each double out to 32 bytes, completely defeating SIMD. (I don't think it will compile, but similar things with GNU C typedef double da __attribute__((aligned(32)))
do compile that way, with sizeof(da) == 32
.)
See Modern approach to making std::vector allocate aligned memory for working code.
As of C++17, std::vector<__m256d>
would work, but is usually not what you want because it makes scalar access a pain.
C++ sucks for this in my experience, although there might be a standard (or Boost) allocator that takes an over-alignment you can use as the second (usually defaulted) template param.
std::vector<double, some_aligned_allocator<32> >
still isn't type-compatible with normal std::vector
, which makes sense because any function that might reallocated it has to maintain alignment. But unfortunately that makes it not type-compatible even for passing to functions that only want read-only access to a std::vector
of double
elements.
Cost of misalignment
For a lot of cases the misalignment is only a couple percent worse than aligned, for AVX/AVX2 loops over an array if data's coming from L3 cache or RAM (on recent Intel CPUs); only with 64-byte vectors do you get a significantly bigger penalty (like 15% or so even when memory bandwidth is still the bottleneck.) You'd hope that the CPU core would have time to deal with it and keep the same number of outstanding off-core transactions in flight. But it doesn't.
For data hot in L1d, misalignment could hurt more even with 32-byte vectors.
In x86-64 code, alignof(max_align_t)
is 16 on mainstream C++ implementations, so in practice even a vector<double>
will end up aligned by 16 at least because the underlying allocator used by new
always aligns at least that much. But that's very often an odd multiple of 16, at least on GNU/Linux. Glibc's allocator (also used by malloc) for large allocations uses mmap
to get a whole range of pages, but it reserves the first 16 bytes for bookkeeping info. This is unfortunate for AVX and AVX-512 because it means your arrays are always misaligned unless you used aligned allocations. (How to solve the 32-byte-alignment issue for AVX load/store operations?)
Mainstream std::vector
implementations are also inefficient when they have to grow: C++ doesn't provide a realloc
equivalent that's compatible with new/delete, so it always has to allocate more space and copy to the start. Never even trying to allocate more space contiguous with the existing mapping (which would be safe even for non-trivially-copyable types), and not using implementation-specific tricks like Linux mremap
to map the same physical pages to a different virtual address without having to copy all those mega/gigabytes. The fact that C++ allows code to redefine operator new
means library implementations of std::vector can't just use a better allocator, either. All of this is a non-problem if you .reserve
the size you're going to need, but it is pretty dumb.