Why should you not access the __m128i fields directly?

Question

I was reading this on MSDN, and it says

You should not access the __m128i fields directly. You can, however, see these types in the debugger. A variable of type __m128i maps to the XMM[0-7] registers.

However, it doesn't explain why. Why is it? For example, is the following "bad":

void func(unsigned short x, unsigned short y)
{
    __m128i a;
    a.m128i_i64[0] = x;

    __m128i b;
    b.m128i_i64[0] = y;

    // Now do something with a and b ...
}

Instead of doing the assignments like in the example above, should one use some sort of load function?

The fields are Microsoft-specific. Of course they don't care about that since they'll love to lock you into their compiler. The real reason is for performance. There's no efficient way to access the individual elements of an SSE register. SSE4.1 has instructions to do it, but the index must be a compile-time constant. — Mysticial, Apr 04 '14 at 17:49

score 8 · Accepted Answer · answered Apr 04 '14 at 18:27

8

The field m128i_i64 and family are Microsoft compiler specific extensions. They don't exist in most other compilers.

Nevertheless, they are useful for testing purposes.

The real reason for avoiding their use is performance. The hardware cannot efficiently access individual elements of a SIMD vector.

There are no instructions that let you directly access individual elements. (SSE4.1 does, but it requires a compile-time constant index.)
Going through memory may incur a very large penalty due to failure of store forwarding.

AVX and AVX2 doesn't extend the SSE4.1 instructions to allow accessing elements in a 256-bit vector. And as far as I can tell, AVX512 will not have it for 512-bit vectors.

Likewise, the set intrinsics (such as _mm256_set_pd()) suffer the same issue. They are implemented either as a series of data shuffling operations. Or by going through memory and taking on the store forwarding stalls.

Which begs the question: Is there an efficient way to populate a SIMD vector from scalar components? (or separate a SIMD vector into scalar components)

Short Answer: Not really. When you use SIMD, you're expected to do a lot of the work in the vectorized form. So the initialization overhead should not matter.

answered Apr 04 '14 at 18:27

Mysticial

464,885
45
335
332

It's good to see an answer by you again Mystical on SIMD. The wiki link on store forwarding is interesting. – Z boson Apr 04 '14 at 19:14
Yeah. Store-forwarding is a pretty big deal on modern processors. Without it you pay 20+ cycle penalties for read after writes. Unfortunately, it tends to fail when you try to read memory using a different size it was written to. Newer processors are better in that you can read as long as it's completely contained within a pending write. But set intrinsics go the other way. And the store units are currently not capable of coalescing smaller stores into a large one so that it can be forwarded to a larger load. – Mysticial Apr 04 '14 at 19:33
Thanks! So in my code example, how should one load the arguments into __m128i types? From some other questions, I can see how to do it with arrays. However, loading just a simple integer seems to give me an access violation. This is probably an alignment issue, but I'm not sure how to fix it in a non-MS-specific way... – Gideon Apr 05 '14 at 09:12
@user3475799 If you need to load a SIMD type from different scalar sources, the set intrinsics are usually the best way to do it. The compiler will (usually) pick the lessor of the evils and generate the fastest code. Also, it shouldn't be crashing. The compiler should automatically be aligning `__m128i` if it's on the stack. If you're allocating it on the heap, then that's a different story. – Mysticial Apr 06 '14 at 04:41

Why should you not access the __m128i fields directly?

1 Answers1

Linked