1

If I am not mistaken,

_mm_shuffle_pd(x, y, _MM_SHUFFLE2(0, 1));

and

_mm_move_sd(x, y);

And also _mm_blend_pd in a later instruction set should all do the same thing.

But clang and gcc generate different instructions on sse2 godbolt. And they replace movps with blend if sse4.2 is avaliable godbolt

Is there a reason I should choose one over the other?

Denis Yaroshevskiy
  • 1,218
  • 11
  • 24
  • 2
    Those two statements don't do the same thing. The low half of `_mm_move_sd`'s output comes from its second operand, high half from the first, opposite of `_mm_shuffle_pd(x,y, int)`. (I made test callers that passed constants to check what if any different there might be since I didn't spot that from just looking at them: https://godbolt.org/z/3nqjdnroK clang helpfully prints comments showing the values after constant-propagation.) If they were identical shuffles, then yeah clang's shuffle optimizer would almost certainly have made the same asm, so I suspected they weren't. – Peter Cordes Jun 26 '23 at 00:24
  • But they are the same, you just might need to change the order of operands. Look, just movsd twice: https://godbolt.org/z/P8MK38vEs – Denis Yaroshevskiy Jun 26 '23 at 00:27
  • 2
    To make them identical (https://godbolt.org/z/qh9Ex67oa), I also needed to change the shuffle constant. So `movsd x,y` and `_mm_shuffle_pd(y, x, _MM_SHUFFLE2(0, 1));` Then clang uses its preferred `shufps` in both cases, and GCC uses `movsd` in both cases. Both ways are the same machine-code size, and `movsd` would be faster on ancient CPUs like Conroe where the shuffle unit is only 64-bit wide (but 64-bit granularity shuffles have no problem.) Did you check https://uops.info/ and check their throughputs on various CPUs that are relevant for `-mtune=generic`? – Peter Cordes Jun 26 '23 at 00:29
  • 2
    Oh cool, clang's shuffle optimizer takes into account avoiding a `movaps` register-copy, using `movsd` or `shufps` so XMM0 (first arg and return value) can be the destination. – Peter Cordes Jun 26 '23 at 00:30
  • 3
    Immediate blends are very cheap; IDK why `movsd x,x` still decodes to a shuffle uop for port 5 only, instead of a `blendpd` uop, on CPUs like Nehalem or Skylake. Compilers are generally correct to favour them. It looks like Ice Lake finally got around to optimizing `movsd x,x` into 1 uop for any of p015. Zen runs it on any vector ALU port. – Peter Cordes Jun 26 '23 at 00:35
  • 1
    Why did they need a new instruction instead of making `_mm_shuffle_pd` better? – Denis Yaroshevskiy Jun 26 '23 at 01:17
  • 1
    `movsd` and `shufpd` were both introduced in SSE2. IDK why `movsd xmm,xmm` is a merge instead of a zero-extending move like SSE2 `movq`, since `shufpd` can already do that rarely-needed merging. Probably for consistency with the SSE1 design choice for `movss x,x` being a merge, but a load zero-extending. On PIII and Pentium-M with 64-bit SIMD execution units, `movss` was 1 uop but `movaps` was 2, since it writes the high half. AMD K8 is the same. – Peter Cordes Jun 26 '23 at 01:27
  • 1
    (IDK if compilers at the time ever generated scalar code using `movss` instead of `movaps`, suffering a false dependency which even future CPUs wouldn't avoid, in exchange for better throughput. Probably not since P6 family had register-read stalls for "cold" registers, the ones you'd want to pick for a move destination to make the output dependency not a problem.) – Peter Cordes Jun 26 '23 at 01:28
  • 1
    As for why `shufpd` doesn't use two 2-bit fields that select from all 4 elements across 2 registers, instead of its current design of two 1-bit fields where the low element selects from within the first reg, and second selects within the second, that's the same pattern of data movement as `shufps`. That's not a very good argument, but Intel's ISA design at the time often seemed poorly thought out, or just to simplify decoding, like `movhpd` being the same as `movhps`, but it's just a prefix in front of the same opcode. – Peter Cordes Jun 26 '23 at 01:37
  • 1
    Also, actual reasons, it takes fewer wires and transistors if each output bit only has to pick from 2 possible input bits, instead of from any of 4 bits across 4 elements of 2x 128-bit inputs. Making `shufpd` more powerful would have cost more transistors, and probably more important, more wires crossing each other in a tight space. – Peter Cordes Jun 26 '23 at 01:39
  • 1
    I was trying to remember where I'd recently written about shuffles that can only pick elements from one vector for part of the output. Just found it: [AV512: Best way to combine horizontal sum and broadcast](https://stackoverflow.com/a/76402515) - `vshuff64x2 z,z,z,imm8` (128-bit shuffle granularity) is like `shufps` in how the immediate is used. And there are cases where you'd expect CPUs to run a shuffle like `vpermilpd` at least as efficiently as `vshufpd` (e.g. duplicating the input and feeding the same immediate), but no, on Ice Lake it runs on fewer ports. – Peter Cordes Jun 26 '23 at 01:52

0 Answers0