2

Trying to understand alignment in relation to SIMD load operations I am slighly confused by the output of the following example code:

double vec[4] = { 2.6, 0.0, 0.0, 0.0 };
auto reg = _mm256_load_pd(&vec[0]);

Here I define and initialize some array vec of four doubles, which results in automatic alignment of 8 bytes, which makes sense, since the size of double is 8 bytes. Printing addresses &vec[i] of the vec values outputs values which have 8 byte alignment:

0x000000E05D4FF968
0x000000E05D4FF970
0x000000E05D4FF978
0x000000E05D4FF980

So far, so good. However the intrinsic function _mm256_load_pd(), which I call next, expects an address which is 256 bit (32 byte) aligned. The address of the first array element 0x000000E05D4FF968 is not divisible by 32 (0x000000E05D4FF968 % 32 = 8) but the code runs with no problems.

So my main question is: how is this possible?

UPDATE:

Here is a minimal reproducible example. Of course the values of addresses will be different every time you run it. But it's not difficult to catch the case where the 1st address is not 32 byte aligned.

#include <iostream>
#include <immintrin.h>

int main()
{
    double vec[4] = { 2.6, 0.0, 0.0, 0.0 };
    auto reg = _mm256_load_pd(&vec[0]);

    for (int i = 0; i < 4; ++i)
    {
        std::cout << &vec[i] << std::endl;
    }

    return 0;
}
nickname
  • 35
  • 6
  • @user17732522, I am sorry, I just noticed that the address of the 1st element is actually not 32 byte aligned and had to edit the question. Now it's updated. – nickname Aug 29 '22 at 18:29
  • I use MSVC compiler (v143) on Windows 10. The address values were printed with `std::cout << &vec[i]`. – nickname Aug 29 '22 at 18:32
  • 1
    OK, then what I mentioned in my previous comments does not apply. I suggest you add a [mre] demonstrating how it is working. Unfortunately I can't answer your question, but it may help someone else see what is going on. – user17732522 Aug 29 '22 at 18:35
  • 1
    MSVC doesn't use alignment-required vector-`mov` instructions in asm, even if you use the corresponding intrinsics. (Also, your code doesn't use the result so it would normally get optimized away, although in a debug build it would happen.) – Peter Cordes Aug 29 '22 at 18:41
  • 1
    Yeah, it is fairly random. Might depend on architecture. Just use `alignas`. – ALX23z Aug 29 '22 at 18:50
  • 1
    BTW, [Aligned and unaligned memory access with AVX/AVX2 intrinsics](https://stackoverflow.com/q/31089502) is another way for alignment-required load intrinsics to compile to asm that doesn't require alignment. (At least with optimization enabled. In your MCVE, enabling optimization would hopefully optimize away the `auto reg` variable entirely, and the load to init it, since it's not `volatile`.) – Peter Cordes Aug 29 '22 at 19:06
  • I appreciate your help! I have compared two compilers (MSVS and gcc) and indeed, the MSVC compiler generates `vmovupd` in both cases: when using `_mm256_load_pd()` as well as `_mm256_loadu_pd()`. Trying to compile my code with gcc compiler, produces the `vmovapd` instruction, but does not run (as expected). To make it run, I used `alignas(32) double vec[4]` as suggested above and it worked. – nickname Aug 29 '22 at 20:24

0 Answers0