Bypass delays when switching execution unit domains

Question

I'm trying to understand possibly bypass delays when switching domains of execution units.

For example, the following two lines of code give exactly the same result.

_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
_mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));

Which line of code is better to use?

The assembly output for the first line gives:

vpslldq xmm1, xmm0, 8
vaddps  xmm0, xmm1, xmm0

The assembly output for the second line gives:

vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64   ; 00000040H
vaddps  xmm2, xmm1, XMMWORD PTR [rcx]

Now if I look at Agner Fog's microarchitecture manual he gives an example on page 112 of using a integer shuffle (pshufd) on float values versus using a float shuffle (shufps) on float values. Switching domains adds 4 extra clock cycles so the solution using shufps is better.

The first line of code I listed using _mm_slli_si128 has to switch domains between integer and float vectors. The second using _mm_shuffle_ps stays in the same domain. Doesn't this imply that the second line of code is the better solution?

No, not yet. But I have some code to do it with. If you want to see why I'm interested see the answer [here](http://stackoverflow.com/questions/19494114/parallel-prefix-cumulative-sum-with-sse) and the prefix_sum_SSE function. — Z boson, Oct 24 '13 at 15:25

score 8 · Accepted Answer · edited Dec 24 '16 at 20:38

8

Section 2.1.4 in the Intel optimization guide indicates that you (and Agner) are quite right on this matter -

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a one- or two-cycle delay can occur. The delay occurs also for tran-sitions between Intel SSE integer and Intel SSE floating-point operation.

enter image description here

So in general it seems you'd be better off keeping within the same stack/domain as much as possible.

Of course benchmarking is always preferred, and all this is worth handling only in case this is indeed a bottleneck in your code.

edited Dec 24 '16 at 20:38

Margaret Bloom

41,768
5
78
124

answered Oct 24 '13 at 19:12

Leeor

19,260
5
56
87

1

Thank you for the answer. In my current case it does not make a difference in performance but I was mostly interesting in having a discussion on the subject since I'm still learning about it. – Z boson Oct 28 '13 at 07:05

Bypass delays when switching execution unit domains

1 Answers1

Linked