12

Along with the introduction of AVX, Intel introduced the VEX encoding scheme into the Intel 64 and IA-32 architecture. This encoding scheme is used mostly with AVX instructions. I was wondering if it's okay to intermix VEX-encoded instructions and the now called "legacy SSE" instructions.

The main reason for me asking this question is code size. Consider these two instructions :

shufps xmm0, xmm0, 0
vshufps xmm0, xmm0, xmm0, 0

I commonly use the first one to "broadcast" a scalar value to all the places in an XMM register. Now, the instruction set says that the only difference between these two (in this case) is that the VEX-encoded one clears the higher (>=128) bits of the YMM register. Supposing that I don't need that, what's the advantage of using the VEX-encoded version in this case? The first instruction takes 4 bytes (0FC6C000), the second - 5 (C5F8C6C000).

Thanks for all the answers in advance.

Daniel Kamil Kozar
  • 18,476
  • 5
  • 50
  • 64

2 Answers2

12

On current implementations, if (at least) the upper halves have been reset (VZEROUPPER or VZEROALL) there is no penalty for using legacy SSE instructions.

As detailed on page 128 in Agner Fog: optimizing subroutines in assembly, using legacy SSE instructions while (some) upper halves are in use carries a performance penalty. This penalty is incurred once when entering the state where YMM registers are split in the middle, and once again when leaving that state.

Mixing VEX-encoded 128-bit instructions and legacy SSE instructions is not a problem.

harold
  • 61,398
  • 6
  • 86
  • 164
  • 1
    See also [Why is this SSE code 6 times slower without VZEROUPPER on Skylake?](https://stackoverflow.com/q/41303780) for the different penalty for having dirty uppers while using legacy SSE instructions, compared to Haswell and Ice Lake mechanism where it's only on transitions. – Peter Cordes Nov 04 '20 at 04:16
  • 1
    According to Intel, in Golden Cove CPUs (Alder Lake-P, Raptor Lake-P, Sapphire Rapids) this penalty have been inscreased compared to previous CPUs. "The Golden Cove CPU microarchitecture has increased the cost of mixing Legacy SSE and VEX without clearing the state of upper registers for power efficiency reasons. When possible, use VEX-encoded instructions for all the SIMD instructions when possible". [Section 3.11.5](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual.html) of Intel Software optimization manual. – Vladislav Kogan Mar 28 '23 at 02:42
-1

It's not safe. According to Intel's software developer manual, VEX.128 version zeros the upper half of the YMM register, legacy SSE version doesn't. Worst thing: some assemblers (like gas) may convert SHUFPS into VSHUFPS while creating object file (when -mavx flag is applied). I found exact same problem working with an assembly file.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Sujon
  • 1
  • 2
  • 1
    `gcc -mavx -c foo.s` still assembles the instructions as written. There is a GAS option `-msse2avx` ([`as(1)` man page](https://man7.org/linux/man-pages/man1/as.1.html), *not* a `gcc` option unless you use `-Wa,-msse2avx`). GCC doesn't pass it on my system with `-mavx`, and it's not on by default. I'd be surprised if any distro ever had gcc pass it when you didn't explicitly use it, but I guess it's plausible. – Peter Cordes Nov 04 '20 at 04:29