2

It is known that GCC/CLang auto-vectorize loops well using SIMD instructions.

Also it is known that there exist alignas() standard C++ attribute, which among other uses also allows to align stack variable, for example following code:

Try it online!

#include <cstdint>
#include <iostream>

int main() {
    alignas(1024) int x[3] = {1, 2, 3};
    alignas(1024) int (&y)[3] = *(&x);

    std::cout << uint64_t(&x) % 1024 << " "
        << uint64_t(&x) % 16384 << std::endl;
    std::cout << uint64_t(&y) % 1024 << " "
        << uint64_t(&y) % 16384 << std::endl;
}

Outputs:

0 9216
0 9216

which means that both x and y are aligned on stack on 1024 bytes but not 16384 bytes.

Lets now see another code:

Try it online!

#include <cstdint>

void f(uint64_t * x, uint64_t * y) {
    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

if compiled with -std=c++20 -O3 -mavx512f attributes on GCC it produces following asm code (provided part of code):

        vmovdqu64       zmm1, ZMMWORD PTR [rdi]
        vpxorq  zmm0, zmm1, ZMMWORD PTR [rsi]
        vmovdqu64       ZMMWORD PTR [rdi], zmm0
        vmovdqu64       zmm0, ZMMWORD PTR [rsi+64]
        vpxorq  zmm0, zmm0, ZMMWORD PTR [rdi+64]
        vmovdqu64       ZMMWORD PTR [rdi+64], zmm0

which two times does AVX-512 unaligned load + xor + unaligned store. So we can understand that our 64-bit array-xor operation was auto-vectorized by GCC to use AVX-512 registers, and loop was unrolled too.

My question is how to tell GCC that provided to function pointers x and y are both aligned to 64 bytes, so that instead of unaligned load (vmovdqu64) like in code above, I can force GCC to use aligned load (vmovdqa64). It is known that aligned load/store can be considerably faster.

My first try to force GCC to do aligned load/store was through following code:

Try it online!

#include <cstdint>

void  g(uint64_t (&x_)[16],
        uint64_t const (&y_)[16]) {

    alignas(64) uint64_t (&x)[16] = x_;
    alignas(64) uint64_t const (&y)[16] = y_;

    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

but this code still produces unaligned load (vmovdqu64) same as in asm code above (of previous code snippet). Hence this alignas(64) hint doesn't give anything useful to improve GCC assembly code.

My Question is how do I force GCC to make aligned auto-vectorization, except for manually writing SIMD intrinsics for all operations like _mm512_load_epi64()?

If possible I need solutions for all of GCC/CLang/MSVC.

Arty
  • 14,883
  • 6
  • 36
  • 69
  • 5
    The aligned load *instruction* is not required to make use of aligned loads: if the address is aligned, the load is aligned. See eg [choice between aligned vs. unaligned x86 SIMD instructions](https://stackoverflow.com/q/52147378/555045) – harold Nov 20 '21 at 12:19
  • @harold Do you mean that if assembly code contains unaligned `vmovdqu64` instruction and if my pointer is aligned then this instruction will be decoded inside CPU as aligned instruction and will take same speed as aligned? Does it mean that manually using aligned `vmovdqa64` will not speedup anything at all, not a bit? Why then there was aligned instruction introduced in CPU, if it gives not even a bit of speedup? – Arty Nov 20 '21 at 12:33
  • 2
    `vmovdqa64` has a modest role as guarding against accidental misalignment. Back in the day (Core2 era and earlier) `movdqu` with an aligned address used to be significantly less efficient than `movdqa`, so back then it made more sense that they were separate instructions. – harold Nov 20 '21 at 12:49
  • 1
    @Arty It appear it was introduced to be faster for older processors but it is not really useful anymore. The instructions are kept for backward compatibility. So yes, there should be no speed up as long as you do not target old architectures AND you enable AVX so to use the VEX prefix (AVX is not enabled by default in GCC/Clang/VS). The benefit of the VEX prefix should only appear if your code is bounded by the instruction decoding which is not very frequent for good SIMD codes on newer processors (unless the loops are aggressively unrolled with a lot of loads/stores). – Jérôme Richard Nov 20 '21 at 12:59

3 Answers3

1

Just now @MarcStevens suggested a working solution for my Question, through using __builtin_assume_aligned:

Try it online!

#include <cstdint>

void f(uint64_t * x_, uint64_t * y_) {
    uint64_t * x = (uint64_t *)__builtin_assume_aligned(x_, 64);
    uint64_t * y = (uint64_t *)__builtin_assume_aligned(y_, 64);

    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

It actually produces code with aligned vmovdqa64 instruction.

But only GCC produces aligned instruction. CLang still uses unaligned, see here, also CLang uses AVX-512 registers only with more than 16 elements.

So still CLang and also MSVC solutions are welcome.

Arty
  • 14,883
  • 6
  • 36
  • 69
  • 1
    Clang does "understand" `__builtin_assume_aligned`; for `-march=icelake-client` (which for now implies `-mprefer-vector-width=256`) it uses `vmovaps`. https://godbolt.org/z/shq9fr6GT. Why are you worried about the asm not using `vmovdqa64`? Do you want to detect accidental misalignment? `__builtin_assume_aligned` makes sure future compiler versions won't for example make asm that goes scalar until an alignment boundary, regardless of whether it chooses to not to bother with different instructions for the aligned case. (Because there's no perf difference at all.) – Peter Cordes Nov 20 '21 at 19:11
  • @PeterCordes The reason why I bother about using strictly aligned load/store is due to initial Question that I asked. And initially I asked that question only because I thought that on modern CPUs aligned load/store is faster. But as you and other people said, aligned load/store instructions are exactly same in speed as unaligned, so then it closes my initial Question, because I wanted to answer it only because keeping in mind extra possiblity in speed, which is not the case. But just to make clean Question/Answer, even if it is silly, I still made an Answer, now only just out of curiosity. – Arty Nov 20 '21 at 19:18
  • 1
    It's not silly to post about `__builtin_assume_aligned`, that's actually important for GCC10 and earlier: https://godbolt.org/z/hTeqaxa8v (assume_aligned(16)) / [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726). So yes it is useful for the compiler to know alignment, just not for the exact reason you thought. `vmovdqu64` only slows down if the data happens to be misaligned at runtime, instead of faulting. – Peter Cordes Nov 20 '21 at 19:31
1

Though not entirely portable for all compilers, __builtin_assume_aligned will tell GCC to assume the pointer are aligned.

I often use a different strategy that is more portable using a helper struct:

template<size_t Bits>
struct alignas(Bits/8) uint64_block_t
{
    static const size_t bits = Bits;
    static const size_t size = bits/64;
    
    std::array<uint64_t,size> v;
    
    uint64_block_t& operator&=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] &= v2.v[i]; return *this; }
    uint64_block_t& operator^=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] ^= v2.v[i]; return *this; }
    uint64_block_t& operator|=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] |= v2.v[i]; return *this; }
    uint64_block_t operator&(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp &= v2; }
    uint64_block_t operator^(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp ^= v2; }
    uint64_block_t operator|(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp |= v2; }
    uint64_block_t operator~() const { uint64_block_t tmp; for (size_t i = 0; i < size; ++i) tmp.v[i] = ~v[i]; return tmp; }
    bool operator==(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return false; return true; }
    bool operator!=(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return true; return false; }
    
    bool get_bit(size_t c) const   { return (v[c/64]>>(c%64))&1; }
    void set_bit(size_t c)         { v[c/64] |= uint64_t(1)<<(c%64); }
    void flip_bit(size_t c)        { v[c/64] ^= uint64_t(1)<<(c%64); }
    void clear_bit(size_t c)       { v[c/64] &= ~(uint64_t(1)<<(c%64)); }
    void set_bit(size_t c, bool b) { v[c/64] &= ~(uint64_t(1)<<(c%64)); v[c/64] |= uint64_t(b ? 1 : 0)<<(c%64); }
    size_t hammingweight() const   { size_t w = 0; for (size_t i = 0; i < size; ++i) w += mccl::hammingweight(v[i]); return w; }
    bool parity() const            { uint64_t x = 0; for (size_t i = 0; i < size; ++i) x ^= v[i]; return mccl::hammingweight(x)%2; }
};

and then convert the pointer to uint64_t to a pointer to this struct using reinterpret_cast.

Converting a loop over uint64_t into a loop over these blocks typically auto vectorize very well.

Marc Stevens
  • 1,628
  • 1
  • 6
  • 16
  • 1
    `std::assume_aligned` is the portable way to access `__builtin_assume_aligned` – Alex Guteniev Nov 20 '21 at 14:16
  • 2
    `reinterpret_cast`ing like that is UB though. – yuri kilochek Nov 20 '21 at 14:23
  • @Yuri Q: is it really UB for a pointer to a contiguous array of uint64_t that is guaranteed to have that alignment? – Marc Stevens Nov 20 '21 at 14:26
  • 1
    @Marc the alignment is irrelevant. There is no object of type `uint64_block_t` at the location pointed to by that pointer, so you aren't allowed to dereference it. – yuri kilochek Nov 20 '21 at 14:34
  • @Yuri, would the converse be allowed, i.e. not UB? Meaning, having a pointer to uint64_block_t<512> and recast it to a pointer to uint64_t? – Marc Stevens Nov 20 '21 at 15:49
  • @Marc yes, but then you're only allowed to access the first `Bits / 64` elements, i.e. the ones that are within the array within `std::array` within the first block. – yuri kilochek Nov 20 '21 at 16:22
  • @Yuri, I think I would disagree with that if the block is guaranteed to have no padding and be exactly an array of uint64_t. If you're given a pointer to an array of blocks and you're allowed to reinterpret_cast to uint64_t* and access the uint64_t within each then it stands to reason you can chain those accesses to a contiguous region of uint64_t. At each accessed memory location there is then precisely an uint64_t (as a member of a block). – Marc Stevens Nov 20 '21 at 16:33
  • 1
    The problem is likely the [strict aliasing rule](https://en.cppreference.com/w/cpp/language/reinterpret_cast) here, and more specifically the fact that types are likely not "similar". The specification is not very clear on this point, but It appear this case is not explicitly accepted and so it theoretically results to an UB. In practice, I think `std::array` *could* have a stronger alignment requirements than its content (although I am not aware of any compiler doing that). AFAIK, GCC use a `may_alias` tag to ensure that there is no problem on x86/x86-64 SIMD types. – Jérôme Richard Nov 20 '21 at 17:38
  • Reading up on the strict aliasing rule, there is the notion of pointer-interconvertible which allows to convert a pointer to a (standard-layout) object to a pointer to its first member. This is actually treated on cppreference under static_cast, so reinterpret_cast might not be needed from block to uint64_t pointer. But of reinterpret_cast would be needed to go back and recover a pointer to a block. – Marc Stevens Nov 20 '21 at 18:04
  • 1
    @Marc The fact that memory is laid out exactly the same way as a single big array of ints doesn't matter. C++ abstract machine pointers are not simply addressed into linear memory (though in practice they are of course implemented like that almost universally), but a distinct concept with a specified behavior. There are specific situations when a pointer can be incremented, and this is not one of them. See https://stackoverflow.com/questions/42420116/stdcomplextn-and-tn2-type-aliasing (actually I was wrong about accessing the ints in the first block, that's not allowed either). – yuri kilochek Nov 20 '21 at 18:16
1

As I imply from your own answer, you're interested in MSVC solution too.

MSVC understands the proper use of alignas as well as its own __declspec(align), it also understands __builtin_assume_aligned, but it intentionally does not want to do anything with known alignment.

My report closed as "Duplicate":

The related reports closed as "Not a bug":

MSVC still takes advantage of alignment of global variables, if it can observe that the pointer points to the global variable. Even this does not work in every case.

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • Thanks for MSVC info, up-voted. Do you know then if there exist CLang/MSVC solutions for my question? Because CLang also ignores this `__builtin_assume_aligned()` as you can see by the link in my answer. – Arty Nov 20 '21 at 14:23
  • @Arty, global variables work on Clang, still not on MSVC: https://godbolt.org/z/8YGjboMYq – Alex Guteniev Nov 20 '21 at 14:36
  • Local variables also work on CLang but starting from loop-1024 instead loop-16, [see here](https://godbolt.org/z/x9jGPzaq1) example. – Arty Nov 20 '21 at 14:43