10

The documentation for vzeroall appears inconsistent. The prose says:

The instruction zeros contents of all XMM or YMM registers.

The pseudocode below that, however, indicates that in 64-bit mode only registers ymm0 through ymm15 are affected:

IF (64-bit mode)
    limit ←15
ELSE
    limit ← 7
FOR i in 0 .. limit:
    simd_reg_file[i][MAXVL-1:0] ← 0

On AVX-512 supporting machines clearing up to ymm15 is not the same as clearing "all" because ymm16 through ymm31 exist.

Is the prose or pseudocode correct?

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 5
    According to google, the pseudocode is correct. Only 0-15 are affected. The bochs implementation also says: `// clear only 16 registers even if AVX-512 is present` – Jester Jan 24 '20 at 19:32
  • 1
    @Jester, the AMD manual says the same. Probably related to processors with AVX512 support no longer require zeroing the upper half of registers for performance reasons. After broadwell vzeroupper wasn't needed (which includes all the AVX512 processors). I'm assuming they decided not to modify the behaviour of vzeroall and vzeroupper because the usage of these instructions wasn't needed on these processors anymore so they are there for legacy reasons mostly. – Michael Petch Jan 24 '20 at 19:45
  • 1
    @MichaelPetch: vzeroupper is still sometimes needed on Skylake; failure to use it can make SSE instructions slow (false dependency): [Why is this SSE code 6 times slower without VZEROUPPER on Skylake?](//stackoverflow.com/q/41303780). But dirtying ymm/zmm16..31 can't cause that problem because they're inaccessible with legacy SSE. (And I think don't participate in saved-upper state transitions which apparently Ice Lake reintroduced). Also, SKX has a turbo effect for a dirty zmm: [Dynamically determining where a rogue AVX-512 instruction is executing](//stackoverflow.com/q/52008788) – Peter Cordes Jan 24 '20 at 20:12
  • 2
    In some ways the effect of not using `vzeroupper` on newer CPUs can be _much worse_ due to the effect of merging uops and [implicit widening](https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html#fn:vz) (that's the thing that was alluded to the in comments that Peter linked). – BeeOnRope Jan 24 '20 at 20:22
  • @BeeOnRope: The mechanism for turbo reduction was widening 128-bit SSE ops to 512 bit for merging? Not just from the dirty upper just sitting there in the register file while running pure integer code? I think I forgot that detail at some point after that, but that makes more sense given that zmm16..31 are safe to leave dirty, and xmm/ymm16..31 can be used via AVX512VL without hurting turbo. That's all there in an explanation I quoted from you on [Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?](//stackoverflow.com/q/49019614) :P – Peter Cordes Jan 24 '20 at 22:59
  • @PeterCordes - correct, anything using the SIMD registers gets widened to the "dirty bit width" (could be 256 or 512, depending the type of dirtying instruction). This includes scalar SSE FP. If you just run integer code (and avoid things like `rep movsb` - a whole other topic), you won't suffer the effect and will eventually get to L0 license. – BeeOnRope Jan 25 '20 at 02:40
  • 1
    The difference between "high" 16-31 and "low" 0-15 registers seem to be like this: dirtying only occurs with the low registers: putting the CPU isn't the dirty upper state does not occur if you only write upper registers. However, once you are in the dirty state, all registers are affected, including the upper registers. This is a little bit inconsistent with my original theory. My original theory was that the implicit widening wasn't (just?) a merging effect, because it happened for VEX-encoded AVX instructions that don't do any merging. – BeeOnRope Jan 25 '20 at 02:46
  • So I thought it was just the "zero extending" (distinct from merging) that those instructions do: effectively _every_ 128-bit or 256-bit instruction is _really_ a 512-bit instruction since it sets the upper bits to zero: simpler than merging, but still affecting all 512 bits. If that was true, however, why wouldn't you get this effect for dirtying upper registers? They suffer the same problem. – BeeOnRope Jan 25 '20 at 02:48
  • Since they don't reduce turbo, I'm back to assuming that this is handled efficiently in the register file or something, e.g., a 128-bit or 256-bit VEX op just uses the 256-bit paths to the ALU and then sets a bit in the result indicating the size, the rest assumed to be zero. There seems to be stuff like this already to support similar stuff "for free" in the scalar regs. So I guess the problem with the lower registers really is related to merging (perhaps the merging happens in the ALU and every uop gets an additional hidden input for the destination). – BeeOnRope Jan 25 '20 at 02:52
  • Somehow that also slows down the VEX-encoded ops, maybe because the upper lanes are powered up in this scenario (ready to handle merging for non-VEX?), and it also affects the upper registers. – BeeOnRope Jan 25 '20 at 02:54

1 Answers1

8

It seems like it was a description issue, if you will look at the latest SDM you will see that description was changed lately and now it says that VZEROALL does not changing YMM16...YMM31.

Intel latest SDM (Oct 2019)

Matt. Stroh
  • 904
  • 7
  • 16
  • Thanks! I did check my SDM copy, which I usually keep pretty up to date, but in this case not up to date enough. – BeeOnRope Jan 27 '20 at 21:50
  • 1
    I googled bit, and I think I found thank to your Q a bug in LLVM where they implement VZEROALL to zeroize all YMM registers including YMM16..,YMM31 - http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20170130/426045.html – Matt. Stroh Jan 27 '20 at 21:54
  • 1
    @Matt.Stroh: that wrong change either never made it in, or has since been reverted. Current clang9.0 will use `ymm16` to save a `__m256` around `_mm256_zeroall()`: https://godbolt.org/z/HK7_Xy. That only makes sense if it knows that zeroall doesn't touch ymm16. clang3.9.1 does spill to memory so maybe it was in for that version, or maybe it just doesn't optimize as efficiently. Hmm, clang (3.9 and current) doesn't know that a `__m128` can be left in xmm0 across `_mm256_zeroupper()`. https://godbolt.org/z/DwMyMV – Peter Cordes Jan 28 '20 at 03:18