10

Our 64-bit application has lots of code (inter alia, in standard libraries) that use xmm0-xmm7 registers in SSE mode.

I would like to implement fast memory copy using ymm registers. I cannot modify all the code that uses xmm registers to add VEX prefix, and I also think that this is not practical, since it will increase the size of the code can make it run slower because of the need for the CPU to decode larger instructions.

I just want to use two ymm registers (and possibly zmm - the affordable processors supporting zmm are promised to be available this year) for fast memory copy.

Question is: how to use the ymm registers but avoid the transition penalties?

Will the penalty occur when I use just ymm8-ymm15 registers (not ymm0-ymm7)? SSE originally had eight 128-bit registers (xmm0-xmm7), but in 64-bit mode there are (xmm8-xmm15) also available for non-VEX-prefixed instructions. However, I have reviewed our 64-bit application and it only use xmm0-xmm7, since it also has a 32-bit version with almost the same code. Does the penalty only occur when the CPU tries in fact to use an xmm register that had been used before as ymm and has one of higher 128 bits non-zero? Isn't it better to just zeroize the ymm registers that I have used after the fast memory copy? For example, I have used an ymm register once to copy 32 bytes of memory - what is the fastest way to zeroize it? Is "vpxor ymm15, ymm15, ymm15" fast enough? (AFAIK, vpxor can be executed on any of the 3 ALU execution ports, p0/p1/p5, while vxorpd can only be execute on p5). Wouldn't be the time to zeroize it more than the gain of using it to just copy 32 bytes of memory?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 1
    What kind of memory copies are you looking at? Are you sure this is your bottleneck, and that the AVX-SSE difference would be the solution? Are you doing lots of tight L1 loads stores? http://stackoverflow.com/questions/18314523/sse-copy-avx-copy-and-stdcopy-performance incidentally also includes a discussion of this issue, together with answers to the main question asked there. – cnettel May 09 '17 at 21:20
  • cnettel - our application uses a compiler that has a standard library with very good implementation of memory copy under 32-bit target, but under 64-bit it just has a for loop on a high-level language, and we want to improve it. For less than 8 bytes, it has a case of 8 various copies depending on the data size, and for 8 bytes and above it just has a loop. If the data is not a multiple of 8, last block overlaps. That's it. So we want to improve it a little bit. – Maxim Masiutin May 09 '17 at 21:30
  • @cnettel (see above) – Maxim Masiutin May 09 '17 at 23:13
  • 2
    Also related: [Why is this SSE code 6 times slower without VZEROUPPER on Skylake?](//stackoverflow.com/q/41303780) for more details about what the transition penalties actually are. – Peter Cordes Aug 25 '19 at 00:56
  • 1
    Also [Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?](https://stackoverflow.com/q/49019614) / [Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?](https://stackoverflow.com/q/58568514) – Peter Cordes Jun 29 '20 at 07:47

5 Answers5

18

Another possibility is to use registers zmm16 - zmm31. These regsters have no non-VEX counterpart. There is no state transition and no penalty for mixing zmm16 - zmm31 with non-VEX SSE code. These 512-bit registers are only available in 64 bit mode and only on processors with AVX512.

A Fog
  • 4,360
  • 1
  • 30
  • 32
14

The optimal solution is probably to recompile all the code with VEX prefixes. The VEX coded instructions are mostly the same size as the non-VEX versions of the same instructions because the non-VEX instructions carry a legacy of a lot of prefixes and escape codes (due to a long history of short-sighted patches in the instruction coding scheme). The VEX prefix combines all the old prefixes and escape codes into a single prefix of two or three bytes (four bytes for AVX512).

A VEX/non-VEX transition works in different ways on different processors (see Why is this SSE code 6 times slower without VZEROUPPER on Skylake?):

Older Intel processors: The VZEROUPPER instruction is needed for a clean transition between different internal states in the processor.

On Intel Skylake or later Processors: The VZEROUPPER is needed to avoid a false dependence of a non-VEX instruction on the upper part of the register.

On current AMD processors: A 256-bit register is treated as two 128-bit registers. The VZEROUPPER is not needed, except for compatibility with Intel processors. The cost of VZEROUPPER is approximately 6 clock cycles.

The advantage of using VEX prefixes on all your instructions is that you avoid these transition costs on all processors. Your legacy code can probably benefit from some 256-bit operations here and there in the hot innermost loop.

The disadvantage of VEX prefixes is that the code is incompatible with old processors, so you might need to preserve your old version for running on old processorrs

Community
  • 1
  • 1
A Fog
  • 4,360
  • 1
  • 30
  • 32
  • Do you mean that the cost of VZEROUPPER is approximately 6 clock cycles on AMD? Or on Intel? Or on any of the processors? We cannot currently recompile all the code to VEX because we are using logs of code, both in the standard libraries of Delphi and third-party libraries that we use are lots of inline assembler code that uses MMX/SSE, and we are not willing to modify all this code. – Maxim Masiutin May 11 '17 at 07:04
7

To avoid the penalties on all architectures just need to issue vzeroall or vzeroupper after the part of your code that uses VEX-encoded instructions, prior to returning to the rest of the code that uses non-VEX instruction.

Issuing those instruction is considered good practice for all AVX-using routines anyway, and is cheap - except perhaps on Knights Landing, but I doubt you are using that architecture. Even if you are, the performance characteristics are quite different from the desktop/Xeon family, so you'll probably want a separate compile there anyway.

These are the only instructions that move from the dirty upper to the clean upper state. You can't simple zero out specific registers that you've used, as the chip isn't tracking the dirty state on a register-by-register basis.

The cost of these vzero* instructions is a few cycles: so if whatever you are doing in AVX is worth it, it will generally be worth it to pay this small cost.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
6

In my experience the best way to Avoiding AVX-SSE (VEX) Transition Penalties is to let the compiler use the native code of the micro-architecture. For example, you can use SSE-Intrinsics alongside the AVX-Intrinsics and use -march=native. My GCC 6.2 compiles the program and uses VEX-Encoded instructions. If you see the assembly generated you will find an extra v before all SSE translated codes. On the other hand, if you are doubted you can use a __asm__ __volatile__ ( "vzeroupper" : : : ); every point of your program, after using ymm registers, but you should be careful about it.

Amiri
  • 2,417
  • 1
  • 15
  • 42
3

I have found an interested note by Agner on an Intel forum at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/704023

It answers the question on what happens if I just use ymm8-ymm9 while the application uses xmm0-xmm7, so we use different registers.

Here is the quote.

I just made a few more experiments on a Haswell. It treats all vector registers as having a dirty upper half if just one ymm register has been touched. In other words, if you modify ymm1 then a non-VEX instruction writing to xmm2 will have a false dependense on the previous value of xmm2. Knights Landing has no such false dependence. Perhaps it is remembering the state of each register separately?

Hopefully, future Intel processors will either remember the state of each register separately, or at least treat zmm16-zmm31 separately so that they don't pollute xmm0-xmm15. Can you reveal something about this?

This answer from 12/28/2016 left unreplied.

There were also some interesting information about VZEROUPPER on Agnger's blog at http://www.agner.org/optimize/blog/read.php?i=761

Community
  • 1
  • 1
Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 2
    I'm not sure why you are mentioning `vpxor` - you can't use that instruction, or any single-register zeroing idiom, to reset the "upper dirty" state. It is very likely that the `vzeroupper` and `vzeroall` will continue to be quick on future desktop/server chips as it is heavily used in high performance and compiler-generated code today. It is nothing like `pushad` and I'm not sure how you drew that conclusion. – BeeOnRope May 10 '17 at 01:45
  • @BeeOnRope - Do you mean that the processor decides that the registers are dirty just when you touch them (their actual value doesn’t matter), so the only way to clear the dirty status is to execute vzeroupper or vzeroall? Is there any official documentation on this? – Maxim Masiutin May 10 '17 at 04:43
  • @BeeOnRope - I will try to measure the real speed of vzeroupper, the following way - two vmovdqa and then one vzeroupper - in a loop; and then the same loop but without vzeroupper - just to figure out whether vmovdqa+vzeroupper to copy just 32 bytes of data is faster than two 16-byte xmm-based movs that do not need vzeroupper. – Maxim Masiutin May 10 '17 at 04:55
  • 2
    Yes, based on various tests and information (including in the Intel thread I linked) - there is only a single core-wide flag: "dirty/clean" for the upper state, and any instruction using `ymm` regs sets you. So the value doesn't matter and you can't undo it with instructions that happen to zero out the high bits of some reg: you need the `vzero*` guys. The Phi is not a "normal" CPU in that people don't buy it to replace a general purpose CPU - but in one configuration it can act as a CPU. It's generally for HPC and other specialized tasks. Don't worry about it for GP code. – BeeOnRope May 10 '17 at 18:38
  • 2
    If you are just copying 32 bytes, then you have nothing to worry about - just use `xmm`: it's only one extra load/store over `ymm` so you are talking like a single cycle, and any messing about with AVX and `vzeroupper` isn't going to pay off. They make sense when you have a substantial region to copy. – BeeOnRope May 10 '17 at 18:39
  • 1
    *CPUID says no MMX or SSE* No, that was just someone on the Internet being wrong. You should probably delete [your earlier comment](https://stackoverflow.com/questions/43879935/avoiding-avx-sse-vex-transition-penalties#comment74802094_43881748) because an Intel employee had already corrected that misconception with CPUID data from a KNL system (https://github.com/xianyi/OpenBLAS/issues/991#issuecomment-273352067) before you posted a link to the thread. KNL is binary-compatible with Haswell, except for TSX. Of course it *performs* very differently from HSW for stuff like byte shuffles... – Peter Cordes May 01 '19 at 23:35
  • @PeterCordes - thank you, I have deleted the earlier comment. – Maxim Masiutin May 02 '19 at 06:39