Writing a ZMM register can leave a Skylake-X (or similar) CPU in a state of reduced max-turbo indefinitely. (SIMD instructions lowering CPU frequency and Dynamically determining where a rogue AVX-512 instruction is executing) Presumably Ice Lake is similar.
(Workaround: not a problem for zmm16..31, according to @BeeOnRope's comments which I quoted in Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?
So this strlen could just use vpxord xmm16,xmm16,xmm16
and vpcmpeqb
with zmm16.)
How to test this if you have hardware:
@BeeOnRope posted test code in an RWT thread: replace vbroadcastsd zmm15, [zero_dp]
with vpcmpeqb k0, zmm0, [rdi]
as the "dirtying" instruction and see if the loop after that runs slow or fast.
I assume executing any 512-bit uop will trigger reduced turbo temporarily (along with shutting down port 1 for vector ALU uops while the 512-bit uop is actually in the back-end), but the question is: Will the CPU recover on its own if you never use vzeroupper
after just reading a ZMM register?
(And/or will later SSE or AVX instructions have transition penalties or false dependencies?)
Specifically, does a strlen
using insns like this need a vzeroupper
before returning? (In practice on any real CPU, and/or as documented by Intel for future-proof best practices.) Assume that later instructions may include non-VEX SSE and/or VEX-encoded AVX1/2, not just GP integer, in case that's relevant to a dirty-upper-256 situation keeping turbo reduced.
; check 64 bytes for zero, strlen building block.
vpxor xmm0,xmm0,xmm0 ; zmm0 = 0 using AVX1 implicit zero-extension
vpcmpeqb k0, zmm0, [rdi] ; 512-bit load + ALU, not micro-fused
;kortestq k0,k0 / jnz or whatever
kmovq rax, k0
tzcnt rax, rax
;vzeroupper before lots of code that goes a long time before another 512-bit uop?
(Inspired by the strlen in AVX512BW: handle 64-bit mask in 32-bit code with bsf / tzcnt? which would look like this if zeroing its vector reg was properly optimized to use a shorter VEX instead of EVEX instruction.)
The key instruction is the vpcmpeqb k0, zmm0, [rdi]
which decodes on SKX or CNL to 2 separate uops (not micro-fused: retire-slots = 2.0): a 512-bit load (into a 512-bit physical register?) and an ALU compare into a mask register.
But no architectural ZMM register is ever written explicitly, only read. So presumably at least an xsave
/xrstor
would clear any "dirty upper" condition, if one exists after this. (This won't happen on Linux unless there's an actual context switch to a different user-space process on that core, or the thread migrates; merely entering the kernel for interrupts won't cause it. So this is actually still testable under a mainstream OS, if you have the hardware; I don't.)
Possibilities I can imagine for SKX/CNL, and/or Ice Lake:
- No long-term effect: max-turbo recovers just as quickly as with
vzeroupper
- Max turbo limited to 512-bit speed until a context switch. (
xrstor
or equivalent clears any dirty-upper state flag because the architectural regs are clean). - Max turbo limited to 512-bit speed even across context switches, just like if you'd run
vaddps zmm0,zmm0,zmm0
. (Dirty upper flag is set in the saved and restored with the architectural state.) Plausible becausexsaveopt
does skip saving the upper 128 or 256 of vector regs if it's known they're clean.
I assume kmovq
won't reduce max turbo or trigger any of the other 512-bit uop effects. The upper 32 bits of mask registers normally only come into play with with AVX512BW for 64-byte vectors, but presumably they don't power-gate the top 32 bits of mask regs separately, only the top 32 bytes of vector regs. There are use-cases like using kshift
or kunpack
to deal with 64-bit chunks of masks (for load/store or transfer to integer regs) even if you only ever generate or use them 32 bits at a time with AVX512VL with YMM or XMM regs.
PS: Xeon Phi is not subject to these effects; it's not built to upclock beyond heavy AVX512 when running other code because it's made to run AVX512. And in fact vzeroupper
is very slow and not recommended on KNL / KNM.
The fact that my example uses AVX512BW is really not relevant to the question, but all mainstream (not Xeon Phi) CPUs with AVX512 have AVX512BW. It just makes a nice real use-case, and the fact that using AVX512BW excludes KNL is irrelevant.