Why Is it so Slow to Access Individual SIMD Elements

Question

I'm learning about SIMD intrinsics in C++ and I am a bit confused. Say I have a __m128 and I want to access the first element of it with __m128.m128_f32[0] (I know this is not implemented for all compilers), why is doing that, supposedly, very slow. Isn't it just a memory read, like any other? I've read some other pages, where things like Load-Hit-Store were mentioned, but I didn't really get it within the context of my question. I know doing something like this is ill-advised, and I don't intend to do it, but I am curious as to what actually causes this to be so slow.

score 4 · Accepted Answer · answered Aug 30 '18 at 02:05

SIMD vector variables are normally in XMM registers, not memory. vector store / scalar reload is one strategy for the compiler to implement a read of an integer element of a vector, but definitely not the only. And usually not a good choice.

The point of this advice is that if you want a horizontal sum, write it with shuffle / add intrinsics, instead of accessing the elements and making the compiler produce probably worse asm than you'd get from well-chosen shuffles. See Fastest way to do horizontal float vector sum on x86 for C implementations, with the compiler-generated asm.

Writing to an element of a vector via memory would be worse because vector store / overlapping scalar store / vector reload would cause a store forwarding stall. But instead, compilers aren't that dumb, and could use a movd xmm0, eax and use a vector shuffle to merge a new element into a vector.

Your specific example of reading __m128.m128_f32[0] is not a good one: it's literally free because a scalar float is normally kept in the low element of an XMM register (unless you're compiling 32-bit code with legacy x87 floating-point for scalar). So the low element of a __m128 vector in an XMM register already is a scalar float that the compiler can use with addss instructions. Calling conventions pass float in XMM registers, and don't require zeroing the upper elements, so there's no extra cost there.

On x86 it's not catastrophically expensive, but you definitely want to avoid it inside inner loops. For float, a good compiler will turn it into shuffles, which you could write yourself with intrinsics that eventually does float _mm_cvtss_f32 (__m128 a) (which compiles to zero instructions, as explained above).

For integer, with SSE4.1 you will hopefully get a pextrd eax, xmm0, 3 or whatever (or a cheaper movd eax, xmm0 for the low element).

On ARM, transfers between integer and vector registers are much more expensive than on x86. At least higher latency, if not bad throughput. On some ARM CPUs, the integer and vector parts of the CPU are not tightly coupled at all, and there are stalls when one side has to wait for a result from the other. (I think I've read that recent ARM, like CPUs that support AArch64, typically have much lower latency int<->SIMD.)

(you didn't tag x86 or SSE, but you did mention __m128 for MSVC, so I mostly answered for x86.

ok. so i must say i don't totally understand. specifically could you reiterate what you mean by vector store / scalar reload is a bad strategy and how exactly store forwarding applies here — Blu342, Aug 30 '18 at 02:09
I now get that accessing the 0th f32 element shouldn't be costly, but why would accessing a higher element be costly, shouldn't it be pretty much the same. Sorry if these are trivial questions, but I'm really trying to understand. Why would you need to shuffle to the 0th element, why not just read from the 1st, 2nd, etc directly. As a separate question, if you can just shuffle to the 0th element, very cheaply, is reading a individual element from the __m128 without modifying it a ok thing to do? — Blu342, Aug 30 '18 at 02:18
@Blu342: It's the vector *reload* of the scalar store where you get a store-forwarding stall. Like if you implement `_mm_set_epi32` with 4 scalar stores and a vector load. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for detailed discussion of gcc's sub-optimal code-gen strategies, with examples of how `_mm_set` compiles and should compile. — Peter Cordes, Aug 30 '18 at 02:24
@Blu342: accessing a higher element isn't very costly; it just costs a `movaps` / `shufps`. But it also means you're doing all that just for one scalar element, vs. if you can keep your code vectorized then you can be operating on more elements at once in the same or fewer instructions / uops / latency / other asm cost metric. That's why I mentioned horizontal sums as an example: for an N element vector, SIMD shuffle/add has O(log2(N)) cost vs. O(N) cost for extracting each element to scalar and adding. — Peter Cordes, Aug 30 '18 at 02:27
right. so basically if you access a high order element, it has to do a movaps to copy the m128, then a mask, to get the relevant bits? or shuffle it to the 0th index, so it can be accessed directly. is that right? thanks for all the help btw. i really appreciate it. — Blu342, Aug 30 '18 at 02:33
@Blu342: yes, exactly. That's why the compiler would use `shufps` to implement `__m128 get2(__m128 x) { return x[2]; }` (GNU C syntax.) Try it on http://godbolt.org/ to see the asm output from gcc vs. MSVC (with optimization enabled), and how even writing a store+reload to an array will still optimize to a shuffle. — Peter Cordes, Aug 30 '18 at 02:37
i see. i'm still a bit confused, but i understand a lot more now thanks to you! thanks for the help and patience. I really appreciate it. — Blu342, Aug 30 '18 at 02:40

Why Is it so Slow to Access Individual SIMD Elements

1 Answers1