Is _mm_load_ps a requirement for 128bit aligned structure?

Question

I have a vector structure setup similar to this: It is 128bit aligned just like the __m128 type.

struct Vector3
{
    union
    {
        float v[4];
        struct { float x,y,z,w; }
    }
}

I am using the SSE 4.1 Dot product instruction _mm_dp_ps. Is it a requirement to use _mm_load_ps for my structure above that is 128bits aligned already or can I directly cast my structure? Is this safe practice?

NOTE: Using VS2013 and include.

Current Code using _mm_load_ps:

float Vector3::Dot(const Vector3 & v) const
{
    __m128 a = _mm_load_ps(&this->v[0]);
    __m128 b = _mm_load_ps(&v.v[0]);

    __m128 res = _mm_dp_ps(a, b, 0x71);
    return res.m128_f32[0];
}

Question Code:

float Vector3::Dot(const Vector3 & v) const
{
    __m128 res = _mm_dp_ps(*(__m128*)&this->v[0], *(__m128*)&v.v[0], 0x71);
    return res.m128_f32[0];
}

Edit: Done some testing

Using this simple console application code I ran 3 different tests. The first using _mm_load_ps, second casting the structure to __m128 type and last using a __m128 type from inside a union.

union Vector4
{
    Vector4(float x, float y, float z, float w) { a = x; b = y; c = z; d = w; }
    struct {float a, b, c, d;};
    __m128 m;
};

int _tmain(int argc, _TCHAR* argv[])
{
    const Vector4 vector_a = Vector4(1.0f, 2.0f, 3.0f, 4.0f);
    const Vector4 vector_b = Vector4(10.0f, 20.0f, 30.0f, 40.0f);

    unsigned long long start;

    // : Test Using _mm_load_ps :
    start = GetTickCount64();
    for (unsigned long long i = 0; i < 10000000000U; i++)
    {
        __m128 mx = _mm_load_ps((float*)&vector_a);
        __m128 my = _mm_load_ps((float*)&vector_b);

        __m128 res_a = _mm_add_ps(mx, my);
    }
    unsigned long long end_a = GetTickCount64() - start;

    // : Test Using Direct Cast to __m128 type :
    start = GetTickCount64();
    for (unsigned long long i = 0; i < 10000000000U; i++)
    {
        __m128 res_b = _mm_add_ps(*(__m128*)&vector_a, *(__m128*)&vector_b);
    }
    unsigned long long end_b = GetTickCount64() - start;

    // : Test Using __m128 type in Union :
    start = GetTickCount64();
    for (unsigned long long i = 0; i < 10000000000U; i++)
    {
        __m128 res_c = _mm_add_ps(vector_a.m, vector_b.m);
    }
    unsigned long long end_c = GetTickCount64() - start;

    return 0;
}

The results were: end_a : 26489 ticks end_b : 19375 ticks end_c : 18767 ticks

I stepped through the code as well and all of the results res_a to res_c were correct. So this test indicates that using the union is faster.

I understand that the __m128 type by default is a reference to the registers used and not a type but when including smmintrin.h the __m128 becomes a union which is defined in xmmintrin.h as

typedef union __declspec(intrin_type) _CRT_ALIGN(16) __m128 {
     float               m128_f32[4];
     unsigned __int64    m128_u64[2];
     __int8              m128_i8[16];
     __int16             m128_i16[8];
     __int32             m128_i32[4];
     __int64             m128_i64[2];
     unsigned __int8     m128_u8[16];
     unsigned __int16    m128_u16[8];
     unsigned __int32    m128_u32[4];
} __m128;

So it is my belief that the instructions performed using the intrinsic includes are not referencing the registers but referencing the defined __m128 type from xmmintrin.h.

So to better iterate my question after this test: Is it safe to use a __m128 structure that is defined in xmmintrin.h inside a structure to use with the intrinsic functions available with Visual Studio 2013?

The cast is ugly, slow, and undefined behavior. Why wouldn't you use the load intrinsic? — Ben Voigt, Feb 02 '15 at 05:44
Check the update. The thread in which you marked this as having an answer did not quite help. — Haydn Trigg, Feb 02 '15 at 06:39

Is _mm_load_ps a requirement for 128bit aligned structure?

0 Answers0