Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

Question

Writing a ZMM register can leave a Skylake-X (or similar) CPU in a state of reduced max-turbo indefinitely. (SIMD instructions lowering CPU frequency and Dynamically determining where a rogue AVX-512 instruction is executing) Presumably Ice Lake is similar.

(Workaround: not a problem for zmm16..31, according to @BeeOnRope's comments which I quoted in Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions? So this strlen could just use vpxord xmm16,xmm16,xmm16 and vpcmpeqb with zmm16.)

How to test this if you have hardware:

@BeeOnRope posted test code in an RWT thread: replace vbroadcastsd zmm15, [zero_dp] with vpcmpeqb k0, zmm0, [rdi] as the "dirtying" instruction and see if the loop after that runs slow or fast.

I assume executing any 512-bit uop will trigger reduced turbo temporarily (along with shutting down port 1 for vector ALU uops while the 512-bit uop is actually in the back-end), but the question is: Will the CPU recover on its own if you never use vzeroupper after just reading a ZMM register?

(And/or will later SSE or AVX instructions have transition penalties or false dependencies?)

Specifically, does a strlen using insns like this need a vzeroupper before returning? (In practice on any real CPU, and/or as documented by Intel for future-proof best practices.) Assume that later instructions may include non-VEX SSE and/or VEX-encoded AVX1/2, not just GP integer, in case that's relevant to a dirty-upper-256 situation keeping turbo reduced.

; check 64 bytes for zero, strlen building block.
    vpxor     xmm0,xmm0,xmm0    ; zmm0 = 0 using AVX1 implicit zero-extension
    vpcmpeqb  k0, zmm0, [rdi]   ; 512-bit load + ALU, not micro-fused
    ;kortestq k0,k0 / jnz or whatever

    kmovq     rax, k0
    tzcnt     rax, rax

  ;vzeroupper  before lots of code that goes a long time before another 512-bit uop?

(Inspired by the strlen in AVX512BW: handle 64-bit mask in 32-bit code with bsf / tzcnt? which would look like this if zeroing its vector reg was properly optimized to use a shorter VEX instead of EVEX instruction.)

The key instruction is the vpcmpeqb k0, zmm0, [rdi] which decodes on SKX or CNL to 2 separate uops (not micro-fused: retire-slots = 2.0): a 512-bit load (into a 512-bit physical register?) and an ALU compare into a mask register.

But no architectural ZMM register is ever written explicitly, only read. So presumably at least an xsave/xrstor would clear any "dirty upper" condition, if one exists after this. (This won't happen on Linux unless there's an actual context switch to a different user-space process on that core, or the thread migrates; merely entering the kernel for interrupts won't cause it. So this is actually still testable under a mainstream OS, if you have the hardware; I don't.)

Possibilities I can imagine for SKX/CNL, and/or Ice Lake:

No long-term effect: max-turbo recovers just as quickly as with vzeroupper
Max turbo limited to 512-bit speed until a context switch. (xrstor or equivalent clears any dirty-upper state flag because the architectural regs are clean).
Max turbo limited to 512-bit speed even across context switches, just like if you'd run vaddps zmm0,zmm0,zmm0. (Dirty upper flag is set in the saved and restored with the architectural state.) Plausible because xsaveopt does skip saving the upper 128 or 256 of vector regs if it's known they're clean.

I assume kmovq won't reduce max turbo or trigger any of the other 512-bit uop effects. The upper 32 bits of mask registers normally only come into play with with AVX512BW for 64-byte vectors, but presumably they don't power-gate the top 32 bits of mask regs separately, only the top 32 bytes of vector regs. There are use-cases like using kshift or kunpack to deal with 64-bit chunks of masks (for load/store or transfer to integer regs) even if you only ever generate or use them 32 bits at a time with AVX512VL with YMM or XMM regs.

PS: Xeon Phi is not subject to these effects; it's not built to upclock beyond heavy AVX512 when running other code because it's made to run AVX512. And in fact vzeroupper is very slow and not recommended on KNL / KNM.

The fact that my example uses AVX512BW is really not relevant to the question, but all mainstream (not Xeon Phi) CPUs with AVX512 have AVX512BW. It just makes a nice real use-case, and the fact that using AVX512BW excludes KNL is irrelevant.

Wait, you can use `k0` as a non-hardcoded-to-zero register in some contexts? — BeeOnRope, Oct 27 '19 at 02:17
Yeah, only using it *as a mask* like `zmm{k0}` or `k1{k0}` is impossible, because that encoding in the opmask field means no mask (equivalent to all *one*. But as a *destination* for compare-into-mask or `kshift` or whatever, yes it's an extra mask reg you can use. `vpcmpeqb k1{k0}, zmm0, zmm1` is not encodeable, but `k0{k1}` is. And so is just plain `k0` of course. — Peter Cordes, Oct 27 '19 at 02:21
That is weird. Wikipedia gets it wrong: "The first one (k0) is, however, a hardcoded constant used to indicate unmasked operations. " Not that Wikipedia is my _go to_ source for AVX-512 info, but... — BeeOnRope, Oct 27 '19 at 02:22
@BeeOnRope: ah, yes that's wrong. It's the encoding that's special, not the contents. Just like `[rbp]` as an addressing mode isn't encodeable (without a `disp8`) because that encoding actually means `[disp32]` or `[RIP+rel32]`. — Peter Cordes, Oct 27 '19 at 02:23
I haven't seen any evidence that `xsave` and `xrestor` magic will snap a process out of slow mode. As far as I can tell, it is indefinite (for processes that never call `vzeroupper`). Probably the dirty state is preserved across those calls. Also, what you were saying applies to all the other cases where a register is "dirtied" but left as bitwise zero, no? Or are you saying that you think that dirty state is preserved across `xrestor` (maybe on a per-reg basis), but obviously only for architectural registers, and you think that the dirtying here applied to a hidden temporary reg, hence ... — BeeOnRope, Oct 27 '19 at 03:22
@BeeOnRope: The mechanism I was picturing is that there's a single flag for the whole set of architectural registers that indicates "upper256 known to be zero" (and another for upper128), which `xsaveopt` would use to compress / avoid stores for those parts of vector regs. If reading doesn't affect that flag, an xrstor from such a state can be like a vzeroupper because the saved state will indicate clear uppers. — Peter Cordes, Oct 27 '19 at 03:46
@BeeOnRope: (continuing this discussion under your answer re: the internal-use regs being dirty. But yes I would have guessed it would use the same flag for turbo purposes.) — Peter Cordes, Oct 27 '19 at 03:54
@PeterCordes I don't think its assosiated with the context switch flags. The [AVX512 context switch flags](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf#page=321) are `ZMM_HI256_state` for zmm0-15, `HI16_ZMM_state` for all xmm/ymm/zmm16-31, and `Opmask_state` for k0-7 (`vzeroupper` does clear `ZMM_HI256_state` so you are right about faster context switches). The thing is, since `HI16_ZMM_state` gets set even for say `xmm16` so I can't imagine that being hijacked the frequency licences. — Noah, Sep 22 '21 at 21:18
@PeterCordes You can easily test this because the [kernel tracks if context switches contain avx512 registers](https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/kernel/fpu/core.c#L108) accessible in `/proc/${pid}/arch_status`. Verified write to `zmm0` (including zero-idiom xor) will set the `ZMM_HI256_state`. That `vzeroupper` will clear the `ZMM_HI256_state`. And that any write (including zero-idiom) to xmm/ymm/zmm16-31 will keep `HI16_ZMM_state` indefinetly set. Also that a read from from xmm/ymm/zmm16 registers (aka `vpxorq %zmm16, %zmm16, %zmm0`) will NOT set `HI16_ZMM_state`. — Noah, Sep 22 '21 at 21:18
@Noah: When I wrote this, I might not have even realized that `vzeroupper` doesn't even touch ZMM16..31. >.< And wasn't clear on what mechanism Bee was saying caused lower clocks; it was that SKX's SSE/AVX strategy was to promote legacy-SSE 128-bit ops to 512-bit with merging into a false output dependency, making legacy-SSE count as 512-bit uops for turbo purposes. Not what I'd initially imagined that clocks suffered even if you didn't run *any* SIMD instructions. I haven't re-read this question recently, since you have, is it spreading misconceptions? And could a change fix that? — Peter Cordes, Sep 22 '21 at 21:28
@Noah: We probably also need to look at my answer on [Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?](https://stackoverflow.com/q/49019614) to see which of my guesses there panned out. I definitely suggested context-switches might be affected there. — Peter Cordes, Sep 22 '21 at 21:30
@PeterCordes my stuff was mainly in response to you and BeeOnRope suggesting that the context switch flags could be the same as the frequency licence flags. It's possible that [modified optimization](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf#page=324) which appears to have an entirely seperate flag is used but still I don't think the register groupings (with xmm/ymm/zmm16-31 all together) make sense in regards to frequency scaling. — Noah, Sep 22 '21 at 21:53
@Noah - the license thing ("implicit widening") occurs when the upper bits of any of the 0-15 registers is dirty. Here, dirty means that *anything* has been written to them, even zeros. The only explicit way to go from dirty to clean in usermode is `vzero[upper|all]`. That is, even though implicit widening could be avoided when the registers are all zero for any reason, this state isn't detected in general (makes sense, seems expensive to track this): the CPU tracks it in a more coarse way. — BeeOnRope, Sep 23 '21 at 05:45
So then the question that came up and which Peter and I discussed (as I understand it) here is if the 0-15 uppers are _zero but dirty_ (i.e., there has been at least one write since the last `vzero*` but the current values happen to be zero), does a context switch which saves and restores the registers serve to effectively flip the state to clean, because perhaps the compressed `xsave` does the work of checking if all uppers are zero? — BeeOnRope, Sep 23 '21 at 05:47
I haven't looked into it deeply, but I did not find any evidence of this. If that's correct, perhaps `xsave` does not in fact "compress" zero registers away and only replies on the coarse dirty upper state (probably the same one involved in implicit widening) and so in the dirty-but-zero case, the compression doesn't happen and all the zeroes are written out explicitly. If that's not the case, it's hard to see why a save/restore wouldn't flip stuff to clean, because where is the extra state to distinguish between clean and dirty-but-zero stored? — BeeOnRope, Sep 23 '21 at 05:50
@BeeOnRope re: "does a context switch which saves and restores the registers serve to effectively flip the state to clean, because perhaps the compressed xsave does the work of checking if all uppers are zero?". No it does not. A write (of any type, even zero-idiom xor to `zmm0-15`) will set the `ZMM_HI256_state` and `zmm0-15` will be stored with the next context switch. The modified optimization seems to pretty explicitly update on state components as well. — Noah, Sep 23 '21 at 16:05
@BeeOnRope if you look at the [x86 kernel size allocation](https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/kernel/fpu/xstate.c#L559) I think its pretty clear that they at least only handle sizes to be changed in blocks of state components. It may "silently" ignore some registers (the manual seems to suggest the opposite) but it definitely does not compress them out. — Noah, Sep 23 '21 at 16:09
@Noah: I guess the kernel has to understand the save/restore format for ptrace access to vector regs of a process? Unless it used `xrstors` to have the CPU decode the area and then used plain YMM or ZMM stores, or used a non-compacting `xsave` (if that can even store YMM/ZMM at all?) instead of `xsaves`, `xsavec`, or `xsaveopt`. I haven't checked, do you know which instruction Linux actually uses for "eager" FPU save/restore on a normal context switch, and what options it enables for it? — Peter Cordes, Sep 23 '21 at 16:16
@PeterCordes Note the "lazy"/"eager" context switch is essentially deprecated since the new `xsave` instructions do the optimizations for the kernel. If the kernel has any `xsave` it will call [this function](https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/include/asm/fpu/internal.h#L292) which according to the comments will use the "best" that the CPU supports [here](https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/include/asm/fpu/internal.h#L231). AFAIK `xsaves` is the newest and implements the `init optimization` and `modified optimization`. — Noah, Sep 23 '21 at 16:28
@PeterCordes "lazy fpu restore" with done with `fnsave` and [comments](https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/kernel/fpu/core.c#L119) suggest its legacy. — Noah, Sep 23 '21 at 16:29
@PeterCordes [this comment](https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/include/asm/fpu/internal.h#L218) has the order of priorities. It appears to be `xsaves` > `xsaveopt` > `xsave`. — Noah, Sep 23 '21 at 16:44
@PeterCordes I don't know if there is any control flow that has the kernel manually store the registers. The [FPU save](https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/x86/kernel/fpu/core.c#L99) only uses some variation of `xsave`/`fxsave`. — Noah, Sep 23 '21 at 16:57
@Noah: If you have time, it'd be interesting to check how `ptrace(PTRACE_GETFPREGS)` works to get the vector reg data in a standard format. Hmm, is that even usable for YMM/ZMM? It says to see `sys/user.h`, which only shows space for x87 and XMM regs in `struct user_fpregs_struct`. Maybe debuggers instead have to use `ptrace(PTRACE_PEEKUSER)` to get at non-baseline architectural state? I know GDB can show YMM upper halves, so it's possible somehow. — Peter Cordes, Sep 23 '21 at 17:19
@PeterCordes I don't really know about `ptrace` but [this GDB patch](https://sourceware.org/pipermail/gdb-patches/2018-May/148678.html) might have what your after. — Noah, Sep 23 '21 at 18:49
@Noah: Ok, so the kernel is just exposing the save area and it's up to user-space to decode it. Makes sense, that's the obvious choice to keep the kernel simple and efficient, punting the work to user-space. (And apparently also indicating that there's a known way to parse the save area, not different on a per-CPU basis.) — Peter Cordes, Sep 23 '21 at 18:54
@PeterCordes re: "And apparently also indicating that there's a known way to parse the save area, not different on a per-CPU basis." Where do you see that? It seems that compact vs non-compact is a very "per CPU" (or at least "per `xsave` option") factor. — Noah, Sep 23 '21 at 18:58
@Noah: I mean that there's a fixed algorithm to decoding, where the same flag bits mean the same thing regardless of Bulldozer vs. Zen vs. Sandybridge vs. Skylake vs. Ice Lake vs. Alder Lake's Efficiency cores. So CPUs are constrained to use that compression format, not use a custom save format optimized for their dirty-uppers strategy (HSW vs. SKL). — Peter Cordes, Sep 23 '21 at 19:04
@PeterCordes ah yes, it seems that `xsave` family is forced to use a format based on these states and cannot optimize on a per register basis. It may be that in certain situations `xsave`/`xrstore` family can optimize out the write/read of a given register, but the space for that register must be taken. Although from the language in the manual regarding the `modified optimization` it seems to also only function on the granularity of states. — Noah, Sep 23 '21 at 19:09

BeeOnRope · Accepted Answer · 2019-10-28T18:32:16.973

No, a vpcmpeqb into a mask register does not trigger slow mode if you use a zmm register as one of the comparands, at least on SKX.

This is also true of any of any other instruction (as far as I tested) which only reads the key 512-bit registers (the key registers being zmm0 - zmm15). For example, vpxord zmm16, zmm0, zmm1 also does not dirty the uppers because while it involves zmm1 and zmm0 which are key registers, it only reads from them while writing zmm16 which is not a key register.

I tested this using avx-turbo on a Xeon W-2104, which has a nominal speed of 3.2 GHz, L1 turbo license (AVX2 turbo) of 2.8 GHz, and a L2 license (AVX-512 turbo) of 2.4 GHz. I used the --dirty-upper option to dirty the uppers before each test with vpxord zmm15, zmm14, zmm15. This causes any test that uses any SIMD registers at all (including scalar SSE FP) to run at the slower 2.8 GHz speed, as shown in these results (look at the A/M-MHz column for cpu frequency):

CPUID highest leaf  : [16h]
Running as root     : [YES]
MSR reads supported : [YES]
CPU pinning enabled : [YES]
CPU supports AVX2   : [YES]
CPU supports AVX-512: [YES]
cpuid = eax = 2, ebx = 266, ecx = 0, edx = 0
cpu: family = 6, model = 85, stepping = 4
tsc_freq = 3191.8 MHz (from calibration loop)
CPU brand string: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
4 available CPUs: [0, 1, 2, 3]
4 physical cores: [0, 1, 2, 3]
Will test up to 1 CPUs
Cores | ID                  | Description                     | OVRLP1 | OVRLP2 | OVRLP3 | Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | pause_only          | pause instruction               |  1.000 |  1.000 | 1.000  | 2256 |      0.99 |    3173 | 1.00       
1     | ucomis_clean        | scalar ucomis (w/ vzeroupper)   |  1.000 |  1.000 | 1.000  |  790 |      1.00 |    3192 | 1.00       
1     | ucomis_dirty        | scalar ucomis (no vzeroupper)   |  1.000 |  1.000 | 1.000  |  466 |      0.88 |    2793 | 1.00       
1     | scalar_iadd         | Scalar integer adds             |  1.000 |  1.000 | 1.000  | 3192 |      0.99 |    3165 | 1.00       
1     | avx128_iadd         | 128-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_iadd         | 256-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 2793 |      0.87 |    2793 | 1.00       
1     | avx512_iadd         | 512-bit integer adds            |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_iadd_t       | 128-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 8380 |      0.88 |    2793 | 1.00       
1     | avx256_iadd_t       | 256-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 8380 |      0.88 |    2793 | 1.00       
1     | avx128_mov_sparse   | 128-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_mov_sparse   | 256-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx512_mov_sparse   | 512-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2794 |      0.87 |    2793 | 1.00       
1     | avx128_merge_sparse | 128-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_merge_sparse | 256-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx512_merge_sparse | 512-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_vshift       | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_vshift       | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx512_vshift       | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_vshift_t     | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 5587 |      0.88 |    2793 | 1.00       
1     | avx256_vshift_t     | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 5588 |      0.88 |    2793 | 1.00       
1     | avx512_vshift_t     | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_imul         | 128-bit integer muls            |  1.000 |  1.000 | 1.000  |  559 |      0.88 |    2793 | 1.00       
1     | avx256_imul         | 256-bit integer muls            |  1.000 |  1.000 | 1.000  |  559 |      0.88 |    2793 | 1.00       
1     | avx512_imul         | 512-bit integer muls            |  1.000 |  1.000 | 1.000  |  559 |      0.88 |    2793 | 1.00       
1     | avx128_fma_sparse   | 128-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx256_fma_sparse   | 256-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx512_fma_sparse   | 512-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx128_fma          | 128-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  698 |      0.88 |    2793 | 1.00       
1     | avx256_fma          | 256-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  698 |      0.87 |    2793 | 1.00       
1     | avx512_fma          | 512-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  698 |      0.88 |    2793 | 1.00       
1     | avx128_fma_t        | 128-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 4789 |      0.75 |    2394 | 1.00       
1     | avx256_fma_t        | 256-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 4790 |      0.75 |    2394 | 1.00       
1     | avx512_fma_t        | 512-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 2394 |      0.75 |    2394 | 1.00       
1     | avx512_vpermw       | 512-bit serial WORD permute     |  1.000 |  1.000 | 1.000  |  466 |      0.88 |    2793 | 1.00       
1     | avx512_vpermw_t     | 512-bit parallel WORD permute   |  1.000 |  1.000 | 1.000  | 1397 |      0.87 |    2793 | 1.00       
1     | avx512_vpermd       | 512-bit serial DWORD permute    |  1.000 |  1.000 | 1.000  |  931 |      0.87 |    2793 | 1.00       
1     | avx512_vpermd_t     | 512-bit parallel DWORD permute  |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00

The only tests that ran at full speed were Scalar integer adds which has no SSE/AVX register use at all, and scalar ucomis (w/ vzeroupper) which has an explicit vzeroupper before each test so doesn't execute with dirty uppers.

Then, I changed the dirtying instruction to the vpcmpeqb k0, zmm0, [rsp] instruction you are interested in. The new results:

Cores | ID                  | Description                     | OVRLP1 | OVRLP2 | OVRLP3 | Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | pause_only          | pause instruction               |  1.000 |  1.000 | 1.000  | 2256 |      1.00 |    3192 | 1.00       
1     | ucomis_clean        | scalar ucomis (w/ vzeroupper)   |  1.000 |  1.000 | 1.000  |  790 |      1.00 |    3192 | 1.00       
1     | ucomis_dirty        | scalar ucomis (no vzeroupper)   |  1.000 |  1.000 | 1.000  |  790 |      1.00 |    3192 | 1.00       
1     | scalar_iadd         | Scalar integer adds             |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx128_iadd         | 128-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3190 | 1.00       
1     | avx256_iadd         | 256-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_iadd         | 512-bit integer adds            |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_iadd_t       | 128-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 9575 |      1.00 |    3192 | 1.00       
1     | avx256_iadd_t       | 256-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 9577 |      1.00 |    3192 | 1.00       
1     | avx128_mov_sparse   | 128-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx256_mov_sparse   | 256-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_mov_sparse   | 512-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx128_merge_sparse | 128-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx256_merge_sparse | 256-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_merge_sparse | 512-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2793 |      0.88 |    2793 | 1.00       
1     | avx128_vshift       | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx256_vshift       | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_vshift       | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_vshift_t     | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 6386 |      1.00 |    3192 | 1.00       
1     | avx256_vshift_t     | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 6386 |      1.00 |    3192 | 1.00       
1     | avx512_vshift_t     | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00       
1     | avx128_imul         | 128-bit integer muls            |  1.000 |  1.000 | 1.000  |  638 |      1.00 |    3192 | 1.00       
1     | avx256_imul         | 256-bit integer muls            |  1.000 |  1.000 | 1.000  |  639 |      1.00 |    3192 | 1.00       
1     | avx512_imul         | 512-bit integer muls            |  1.000 |  1.000 | 1.000  |  559 |      0.88 |    2793 | 1.00       
1     | avx128_fma_sparse   | 128-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx256_fma_sparse   | 256-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 3193 |      1.00 |    3192 | 1.00       
1     | avx512_fma_sparse   | 512-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2793 |      0.87 |    2793 | 1.00       
1     | avx128_fma          | 128-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  798 |      1.00 |    3192 | 1.00       
1     | avx256_fma          | 256-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  798 |      1.00 |    3192 | 1.00       
1     | avx512_fma          | 512-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  698 |      0.88 |    2793 | 1.00       
1     | avx128_fma_t        | 128-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 6384 |      1.00 |    3192 | 1.00       
1     | avx256_fma_t        | 256-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 5587 |      0.87 |    2793 | 1.00       
1     | avx512_fma_t        | 512-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 2394 |      0.75 |    2394 | 1.00       
1     | avx512_vpermw       | 512-bit serial WORD permute     |  1.000 |  1.000 | 1.000  |  466 |      0.87 |    2793 | 1.00       
1     | avx512_vpermw_t     | 512-bit parallel WORD permute   |  1.000 |  1.000 | 1.000  | 1397 |      0.88 |    2793 | 1.00       
1     | avx512_vpermd       | 512-bit serial DWORD permute    |  1.000 |  1.000 | 1.000  |  931 |      0.88 |    2793 | 1.00       
1     | avx512_vpermd_t     | 512-bit parallel DWORD permute  |  1.000 |  1.000 | 1.000  | 2794 |      0.88 |    2793 | 1.00

Most tests now run at full speed. The ones still running at 2.8 GHz (or in one case 2.4 GHz for parallel 512-bit FMAs) are those which actually use 512-bit vectors, or use 256-bit vectors and heavy FP instructions like FMA, as expected.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/237515/discussion-on-answer-by-beeonrope-does-skylake-need-vzeroupper-for-turbo-clocks). — Samuel Liew, Sep 26 '21 at 09:31

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

How to test this if you have hardware:

1 Answers1

Linked