Why _mm256_load_pd compiled to MOVUPD instead of MOVAPD?

Question

Why the following code results unaligned AVX instructions ( MOVUPD instead of MOVAPD)? I compiled this on Visual Studio 2015. How can I tell the compiler that my data is indeed aligned?

    const size_t ALIGN_SIZE = 64;
    const size_t ARRAY_SIZE = 1024;

    double __declspec(align(ALIGN_SIZE)) a[ARRAY_SIZE];
    double __declspec(align(ALIGN_SIZE)) b[ARRAY_SIZE];

    //Calculate the dotproduct
    __m256d ymm0 = _mm256_set1_pd(0.0);
    for (int i = 0; i < ARRAY_SIZE; i += 8)
    {
        __m256d ymm1 = _mm256_load_pd(a + i); 
        __m256d ymm2 = _mm256_load_pd(b + i);
        __m256d ymm3 = _mm256_mul_pd(ymm1, ymm2);
        ymm0 = _mm256_add_pd(ymm3, ymm0);

        __m256d ymm4 = _mm256_load_pd(a + i + 4);
        __m256d ymm5 = _mm256_load_pd(b + i + 4);
        __m256d ymm6 = _mm256_mul_pd(ymm4, ymm5);
        ymm0 = _mm256_add_pd(ymm6, ymm0);
    }



Assembly of the loop: 
00007FF7AC7A1400  vmovupd     ymm1,ymmword ptr [rbp+rax*8+2020h]  
00007FF7AC7A1409  vmulpd      ymm3,ymm1,ymmword ptr [rbp+rax*8+20h]  
00007FF7AC7A140F  vmovupd     ymm2,ymmword ptr [rbp+rax*8]  
00007FF7AC7A1415  vmulpd      ymm0,ymm2,ymmword ptr b[rax*8]  
00007FF7AC7A141E  add         r8d,8  
00007FF7AC7A1422  movsxd      rax,r8d  
00007FF7AC7A1425  vaddpd      ymm1,ymm0,ymm4  
00007FF7AC7A1429  vaddpd      ymm4,ymm1,ymm3  
00007FF7AC7A142D  cmp         rax,400h  
00007FF7AC7A1433  jb          main+70h (07FF7AC7A1400h)

It doesn't really matter - there is virtually no penalty for using unaligned loads with aligned data in modern CPUs - the compiler writers probably just decided to always use unaligned loads rather than having additional logic to decide when to use aligned versus unaligned loads. — Paul R, Apr 19 '16 at 06:49
FWIW gcc *et al* do the right thing, so it looks like this is just a Microsoft-specific quirk. — Paul R, Apr 19 '16 at 07:05
@PaulR, why use the word virtual? There is no penalty at all that I am aware of. `vmovapd` is obsolete. `mvovapd` is still useful on nehalem because `movupd` cannot fold with other operations but I doubt this makes much of a difference in practice. Maybe that's what you meant by virtual but in that case it only applies to Nehalem and this answer is clearly not compiled for Nehalem. — Z boson, Apr 19 '16 at 07:09
@Zboson: well spotted - I was actually just hedging my bets in case I'd forgotten some corner case or other where it might make a difference! — Paul R, Apr 19 '16 at 07:11
Thank you. I was expecting the difference to be minimal, but not zero. — Laci, Apr 19 '16 at 07:16
@PaulR, Clang is like MSVC in this case. GCC adds a bunch of code if it can't assume the pointer is aligned so informing the compiler a pointer is aligned is only useful for GCC since AVX. GCC gives much better results on pre-Nehalam processors. However, if you tell Clang a pointer is aligned it does well also on pre-Nahalem procesors. MSVC is just bad on pre-Nehalem processors. — Z boson, Apr 19 '16 at 07:16
You can see a lot more details [here](http://stackoverflow.com/q/33504003/2542702). — Z boson, Apr 19 '16 at 07:18
Thanks - I just revisited the question you linked to and I see you've now added a second, summary answer - very useful. — Paul R, Apr 19 '16 at 07:20
@PaulR, thanks paul. The only information I did not add in that answer is that Clang does well on pre-Nehalm if you tell it that the pointer is aligned. In fact, if I recall correctly, Clang does even better than GCC because it unrolls the loop four times whereas GCC does not unroll by default. — Z boson, Apr 19 '16 at 07:25

score 1 · Accepted Answer · answered Mar 29 '17 at 15:03

There is the way to solve this problem (it allows to use instruction VMOVDQA (analogue of MOVAPD) instead of MOVUPD):

inline __m256d Load(const double * p)
{
#ifdef _MSC_VER
    return _mm256_castsi256_pd(_mm256_load_si256((__m256i*)p));
#else
    return _mm256_load_pd(p);
#endif
}

Analogous solution for float type:

inline __m256 Load(const float * p)
{
#ifdef _MSC_VER
    return _mm256_castsi256_ps(_mm256_load_si256((__m256i*)p));
#else
    return _mm256_load_ps(p);
#endif
}

But in order to cheat Visual Studio compiler you have to use dynamically allocated pointers. Otherwise compiler doesn't use VMOVDQA instruction.

#include <immintrin.h>

int main()
{
    float * ps = (float*)_mm_malloc(40, 32);
    double * pd = (double*)_mm_malloc(40, 32);

    __m256 s = Load(ps);
//00007FF79FF81325  vmovdqa     ymm1,ymmword ptr [rdi]  
    __m256d d = Load(pd);
//00007FF79FF8132F  vmovdqa     ymm0,ymmword ptr [rax]

    _mm256_storeu_ps(ps, s);
    _mm256_storeu_pd(pd, d);

    _mm_free(ps);
    _mm_free(pd);
}

Why _mm256_load_pd compiled to MOVUPD instead of MOVAPD?

1 Answers1

Linked