8

I'm trying to understand possibly bypass delays when switching domains of execution units.

For example, the following two lines of code give exactly the same result.

_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
_mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));

Which line of code is better to use?

The assembly output for the first line gives:

vpslldq xmm1, xmm0, 8
vaddps  xmm0, xmm1, xmm0

The assembly output for the second line gives:

vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64   ; 00000040H
vaddps  xmm2, xmm1, XMMWORD PTR [rcx]

Now if I look at Agner Fog's microarchitecture manual he gives an example on page 112 of using a integer shuffle (pshufd) on float values versus using a float shuffle (shufps) on float values. Switching domains adds 4 extra clock cycles so the solution using shufps is better.

The first line of code I listed using _mm_slli_si128 has to switch domains between integer and float vectors. The second using _mm_shuffle_ps stays in the same domain. Doesn't this imply that the second line of code is the better solution?

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Have you tried benchmarking this? – Leeor Oct 24 '13 at 15:07
  • No, not yet. But I have some code to do it with. If you want to see why I'm interested see the answer [here](http://stackoverflow.com/questions/19494114/parallel-prefix-cumulative-sum-with-sse) and the prefix_sum_SSE function. – Z boson Oct 24 '13 at 15:25

1 Answers1

8

Section 2.1.4 in the Intel optimization guide indicates that you (and Agner) are quite right on this matter -

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a one- or two-cycle delay can occur. The delay occurs also for tran-sitions between Intel SSE integer and Intel SSE floating-point operation.

enter image description here

So in general it seems you'd be better off keeping within the same stack/domain as much as possible.

Of course benchmarking is always preferred, and all this is worth handling only in case this is indeed a bottleneck in your code.

Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
Leeor
  • 19,260
  • 5
  • 56
  • 87
  • 1
    Thank you for the answer. In my current case it does not make a difference in performance but I was mostly interesting in having a discussion on the subject since I'm still learning about it. – Z boson Oct 28 '13 at 07:05