55

I read this article. It talked about why AVX-512 instruction:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core, or maybe the rest of the CPU to run slower because of how much power they use.

I think on Agner's blog also mentioned something similar (but I can't find the exact post).

I wonder what other instructions supported by Skylake have the similar effect that they will lower the power to maximize the throughput later? All the v prefixed instructions (such as vmovapd, vmulpd, vaddpd, vsubpd, vfmadd213pd)?

I am trying to compile a list of instructions to avoid when compiling my C++ application for Xeon Skylake.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
HCSF
  • 2,387
  • 1
  • 14
  • 40
  • `instructions to avoid` in order to accomplish what exactly? – 500 - Internal Server Error Jul 02 '19 at 12:49
  • @500-InternalServerError in order to avoid jitters in the system. Think about a laser arm gets jitters. – HCSF Jul 02 '19 at 12:51
  • 2
    Trevis Down (aka Beeonrope on OS) wrote about this in the comments in this [post](https://lemire.me/blog/2018/08/13/the-dangers-of-avx-512-throttling-myth-or-reality/) and continued the discussion [here](https://www.realworldtech.com/forum/?threadid=179654&curpostid=179727). He found that each ties (scalar, AVX/AVX2, AVX-512) has "cheap" (no FP, simple operations) instructions and "heavy" instruction. Cheap instructions drop the frequency to the one of the next higher tier (e.g. cheap AVX-512 inst use the AVX/AVX2 tier) even if used sparsely. Heavy inst must be used more than 1 every ... – Margaret Bloom Jul 02 '19 at 13:06
  • 2
    ... two cycles and drop the frequency according to their tier (e.g. AVX-512 heavy instrs drop the frequency to the AV-512 base). Travis also shared the code he used to test [here](https://github.com/travisdowns/avx-turbo). You can find the behaviour of each instruction with a bit of patience or by his rule of thumb. Finally note that this frequency scaling is a problem iif the ratio of vector to scalar instruction is low enough so that the drop in frequency is not balanced by the bigger width at which data is processed. Check the final binary to see if you really gained anything. – Margaret Bloom Jul 02 '19 at 13:10
  • @MargaretBloom thanks for sharing your thought and all the links. I also read Beeonrope's [post](https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake/41349852#41349852) about the penalty in ld. Given my ld is very old, I think it is best for me to avoid AVX and AVX512 related instructions. And as you pointed out, the ratio of vector to scalar is also important. Given I write high level C++ code, it is hard to figure the ratio unless I check the assembly output each time, slowing the the development... – HCSF Jul 02 '19 at 14:33
  • 1
    @HCSF You can make three builds, one without AVX, one with AVX/AVX2 and one with AVX-512 (if applicable) and profile them. Then take the fastest one. – Margaret Bloom Jul 02 '19 at 14:52
  • @HCSF - you can avoid the ldd related penalty by issuing a `vzeroupper` at the start of your program. – BeeOnRope Jul 03 '19 at 00:03
  • @BeeOnRope Based on your answer, is there anyway to tell GCC not to generate any AVX-512 and AVX-256-heavy instructions? But all other instructions are okay. – HCSF Jul 03 '19 at 06:54
  • 1
    Peter mentioned the `-mpreferred-vector-width=256` option. I don't know if it prevents gcc from _ever_ producing AVX-512 instructions (outside of direct intrinsic use), but it is certainly possible. I am not aware of any option which distinguishes between "heavy" and "light" instructions however. Usually this isn't a problem, since if you turn off AVX-512 and don't have a bunch of FP ops, you are probably targeting L0 anyways, and AVX-512 light is still L1. – BeeOnRope Jul 03 '19 at 06:57
  • Try those options and then check if any L1/L2 instructions pop up using the performance counter events for L1 and L2 licenses. – BeeOnRope Jul 03 '19 at 06:57
  • I can try it now. But is there a way to check whether L1/L2 instructions are in the binary? – HCSF Jul 03 '19 at 06:59
  • I tried to compile with `-march=skylake-avx512 -mtune=skylake-avx512 -mprefer-vector-width=128`, and then I decompiled it `objdump -d my binary > binary.asm`, and then `grep -i ymm binary.asm`. I guess it is safe to conclude that it doesn't use any 256 and 512 bit registers and so no AVX-256 and AVX-512 instructions are emitted? @BeeOnRope Tho, I still see many `vzeroupper` instructions. I thought it were only used with ymm registers. No? – HCSF Jul 03 '19 at 07:34
  • Yeah that's a reasonable way to check the binary. Keep in mind at runtime you'll likely use other libraries, at a minimum libc - and these have 256-bit instructions, eg in their memcpy implementation. So you really have to do a runtime check to be sure you aren't executing any "forbidden" instructions. I don't the 256b instructions in libc are likely to be a problem wrt the licenses since they are light. – BeeOnRope Jul 03 '19 at 14:30
  • Yeah vzeroupper makes more sense after using umm registers, to avoid transition penalties for "dirty upper" and probably isn't needed for xmm only code. I think there is a flag to turn it's emission off. – BeeOnRope Jul 03 '19 at 14:31
  • @BeeOnRope you brought up an interesting point -- "other libraries, at a minimum libc - and these have 256-bit instructions". I thought most libraries come with the Linux distros were not compiled for a specific x86 CPU, and some x86 CPUs don't have AVX 256 support and so library like libc shouldn't have any 256-bit instructions. No? – HCSF Jul 03 '19 at 15:06
  • 1
    @HCSF important routines in libc are generally compiled multiple times for different ISAs and then the version appropriate for the current CPU is selected at runtime using the dynamic loader's IFUNC capability. So you'll usually get a version optimized for your CPU (unless your libc is quite old and your CPU quite new). – BeeOnRope Jul 04 '19 at 00:16

2 Answers2

76

On Intel chips, the frequency impact and the specific frequency transition behavior depends on both the width of the operation and the specific instruction used.

As far as instruction-related frequency limits go, there are three frequency levels – so-called licenses – from fastest to slowest: L0, L1 and L2. L0 is the "nominal" speed you'll see written on the box: when the chip says "3.5 GHz turbo", they are referring to the single-core L0 turbo. L1 is a lower speed sometimes called AVX turbo or AVX2 turbo5, originally associated with AVX and AVX2 instructions1. L2 is a lower speed than L1, sometimes called "AVX-512 turbo".

The exact speeds for each license also depend on the number of active cores. For up to date tables, you can usually consult WikiChip. For example, the table for the Xeon Gold 5120 is here:

Xeon Gold 5120 Frequencies

The Normal, AVX2 and AVX512 rows correspond to the L0, L1 and L2 licenses respectively. Note that the relative slowdown for L1 and L2 licenses generally gets worse as the number of cores increase: for 1 or 2 active cores the L1 and L2 speeds are 97% and 91% of L0, but for 13 or 14 cores they are 85% and 62% respectively. This varies by chip, but the general trend is usually the same.

Those preliminaries out of the way, let's get to what I think you are asking: which instructions cause which licenses to be activated?

Here's a table, showing the implied license for instructions based on their width and their categorization as light or heavy:

   Width    Light   Heavy  
 --------- ------- ------- 
  Scalar    L0      N/A
  128-bit   L0      L0     
  256-bit   L0      L1*    
  512-bit   L1      L2*

*soft transition (see below)

So we immediately see that all scalar (non-SIMD) instructions and all 128-bit wide instructions2 always run at full speed in the L0 license.

256-bit instructions will run in L0 or L1, depending on whether they are light or heavy, and 512-bit instructions will run in L1 or L2 on the same basis.

So what is this light and heavy thing?

Light vs Heavy

It's easiest to start by explaining heavy instructions.

Heavy instructions are all SIMD instructions that need to run on the FP/FMA unit. Basically that's the majority of the FP instructions (those usually ending in ps or pd, like addpd) as well as integer multiplication instructions which largely start with vpmul or vpmad since SIMD integer multiplication actually runs on the SIMD unit, as well as vplzcnt(q|d) which apparently also runs on the FMA unit.

Given that, light instructions are everything else. In particular, integer arithmetic other than multiplication, logical instructions, shuffles/blends (including FP) and SIMD load and store are light.

Transitions

The L1 and L2 entries in the Heavy column are marked with an asterisk, like L1*. That's because these instructions cause a soft transition when they occur. The other L1 entry (for 512-bit light instructions) causes a hard transition. Here we'll discuss the two transition types.

Hard Transition

A hard transition occurs immediately as soon as any instruction with the given license executes4. The CPU stops, takes some halt cycles and enters the new mode.

Soft Transition

Unlike hard transitions, a soft transition doesn't occur immediately as soon as any instruction is executed. Rather, the instructions initially execute with a reduced throughput (as slow as 1/4 their normal rate), without changing the frequency. If the CPU decides that "enough" heavy instructions are executing per unit time, and a specific threshold is reached, a transition to the higher-numbered license occurs.

That is, the CPU understands that if only a few heavy instructions arrive, or even if many arrive but they aren't dense when considering other non-heavy instructions, it may not be worth reducing the frequency.

Guidelines

Given the above, we can establish some reasonable guidelines. You never have to be scared of 128-bit instructions, since they never cause license related3 downclocking.

Furthermore, you never have to be worried about light 256-bit wide instructions either, since they also don't cause downclocking. If you aren't using a lot of vectorized FP math, you aren't likely to be using heavy instructions, so this would apply to you. Indeed, compilers already liberally insert 256-bit instructions when you use the appropriate -march option, especially for data movement and auto-vectorized loops.

Using heavy AVX/AVX2 instructions and light AVX-512 instructions is trickier, because you will run in the L1 licenses. If only a small part of your process (say 10%) can take advantage, it probably isn't worth slowing down the rest of your application. The penalties associated with L1 are generally moderate - but check the details for your chip.

Using heavy AVX-512 instructions is even trickier, because the L2 license comes with serious frequency penalties on most chips. On the other hand, it is important to note that only FP and integer multiply instructions fall into the heavy category, so as a practical matter a lot of integer 512-bit wide use will only incur the L1 license.


1 Although, as we'll see, this a bit of a misnomer because AVX-512 instructions can set the speed to this license, and some AVX/2 instructions don't.

2 128-bit wide means using xmm registers, regardless of what instruction set they were introduced in - mainstream AVX-512 contains 128-bit variants for most/all new instructions.

3 Note the weasel clause license related - you may certainly suffer other causes of downclocking, such as thermal, power or current limits, and it is possible that 128-bit instructions could trigger this, but I think it is fairly unlikely on a desktop or server system (low power, small form factor devices are another matter).

4 Evidently, we are talking only about transitions to a higher-level license, e.g., from L0 to L1 when a hard-transition L1 instruction executes. If you are already in L1 or L2 nothing happens - there is no transition if you are already in the same level and you don't transition to lower-numbered levels based on any specific instruction but rather running for a certain time without any instructions of the higher-numbered level.

5 Out of the two AVX2 turbo is more common, which I never really understood because 256-bit instructions are as much associated with AVX as compared to AVX2, and most of the heavy instructions which actually trigger AVX turbo (L1 license) are actually FP instructions in AVX, not AVX2. The only exception is AVX2 integer multiplies.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/195899/discussion-on-answer-by-beeonrope-instructions-lowering-the-power). – Samuel Liew Jul 03 '19 at 05:35
  • Interesting. `vplznctd/q` on the FMA unit makes sense, though: it needs bit-scan hardware to renormalize the results of FP math by finding the MSB of the significand result. – Peter Cordes Jul 13 '19 at 17:17
  • @PeterCordes - yeah I saw this [here](https://twitter.com/InstLatX64/status/1150083791484063745) which links a comprehensive test for all AVX-512 instructions. There is something weird about it though, as described in the comments on that tweet: although the 256-bit version is clearly "heavy", the 512-bit version seems to be mostly light according to this test. However, the test may simply not be triggering L2 because the instructions aren't dense enough. – BeeOnRope Jul 13 '19 at 17:27
  • Interestingly, the dumps pointed by the Twitter post seem to suggest that *all* integer multiplies are actually 'light', except for `VPMULLD` - am I reading it right? – zinga Jul 26 '19 at 11:32
  • When did these licenses first appear? I don't remember this issue with Sandy Bridge or Ivy Bridge. Did it exist with Haswell? Maybe AVX2 turbo is used because no system with AVX only had a SSE and AVX frequency? – Z boson Aug 02 '19 at 07:32
  • I don't think the kinds of instructions is a suffient metric for determining the license (frequency level). I wrote code to indirectly measure the frequency. I noticed that for AVX and AVX512 the the frequency only scaled down if the ports had sufficient load. For example if you have a dependency chain which is latency bound and therefore only does one AVX512 FMA every 5 clock cycles (or whatever the latency of FMA is) then the frequency does not scale down i.e. it stays license L0. See the update to my answer here https://stackoverflow.com/a/25400230/2542702 – Z boson Aug 02 '19 at 07:50
  • @Zboson: That's the difference between "hard" and "soft" transitions described above. Your results seem to show the AVX512 running at L1 or L2 depending on load, not L0. – zinga Aug 02 '19 at 09:47
  • @zinga, you're right, I should have read the answer more carefully. I got confused by heavy vs. hard and light vs. soft. "the CPU understands that if only a few heavy instructions arrive, or even if many arrive but they aren't dense when considering other non-heavy instructions, it may not be worth reducing the frequency." – Z boson Aug 02 '19 at 10:47
  • @Zboson - yeah it would be interesting to explore exactly when the transition takes place. I think @ Mysticial has said that the transition function isn't *that* smart, i.e., that it decides to make a transition to a slower speed even when the stready-state code will objectively run slower after the transition (e.g., code with a 50/50 mix of FMA and non-FMA would be better of not transitioning since you only need ~1 FMA, you can get in the faster license but instead it transitions). – BeeOnRope Aug 02 '19 at 16:10
  • 1
    @Zboson - I think it first showed up in the Haswell server chips, i.e., Haswell-EP or whatever it was called. The name AVX2 turbo speed never made much sense to me: it mostly affects FP instructions from the AVX set, not AVX2 which was mostly integer (integer mul is an exception). Intel themselves use AVX, not AVX2 in [early documents](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf). People seem to like to call it AVX2 though, maybe because it came out in Haswell where AVX2 was the new ISA? – BeeOnRope Aug 02 '19 at 16:17
  • 1
    @BeeOnRope Might be changing for Saphire Rapids. There don't appear to be any license transition [events](https://download.01.org/perfmon/SPR/sapphirerapids_core_v1.00.json) anymore. – Noah Apr 24 '22 at 00:26
  • @Noah these licences don't exist anymore? – user997112 Jul 04 '23 at 00:58
  • @user997112 not AFAICT (old link is dead, the moved the files), but still don't see any license transition events for SPR [on their github page](https://github.com/intel/perfmon/blob/main/SPR/events/sapphirerapids_core.json). Don't have an SPR machine onhand so can't test. – Noah Jul 04 '23 at 01:08
  • @user997112 also FWIW, I know we plan to enable avx512 by default for SPR in GLIBC because the freq throttling isn't a concern although thats only for the "light" instructions. – Noah Jul 04 '23 at 02:58
  • @Noah - that's pretty interesting, though it's not 100% clear if the events disapearing means that the licenses truly are gone. – BeeOnRope Jul 26 '23 at 07:08
17

It's not the instruction mnemonic that matters, it's 512-bit vector width at all that matters.

You can use the 256-bit version of AVX-512VL instructions, e.g. vpternlogd ymm0, ymm1, ymm2 without incurring the AVX-512 turbo penalty.

Related: Dynamically determining where a rogue AVX-512 instruction is executing is about a case where one AVX-512 instruction in glibc init code or something left a dirty upper ZMM that gimped max turbo for the rest of the process lifetime. (Or until a vzeroupper maybe)

Although there can be other turbo impacts from light / heavy use of 256-bit FP math instructions, and some of that is due to heat. But usually 256-bit is worth it on modern CPUs.

Anyway, this is why gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256. For any given workload, it's worth trying -mprefer-vector-width=512 and maybe also 128, depending on how much or how little of the work can usefully auto-vectorize.

Tell GCC to tune for your CPU (e.g. -march=native) and it will hopefully make good choices. Although on a desktop Skylake-X, the turbo penalty is smaller than a Xeon. And if your code does actually benefit from 512-bit vectorization, it can be worth it to pay the penalty.

(Also beware the other major effect of Skylake-family CPUs going into 512-bit vector mode: the vector ALUs on port 1 shut down, so only scalar instructions like popcnt or add can use port 1. So vpand and vpaddb etc. throughput drops from 3 to 2 per clock. And if you're on an SKX with two 512-bit FMA units, the extra one on port 5 powers up, so then FMAs compete with shuffles.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I have been using `-march=generic` for a long time for my binary. So I think even `-march=skylake-avx512 -mpreferred-vector-width=128` would make some optimization kick in without the heavy penalty from using avx-256 (as I ask for 128). Thought? – HCSF Jul 03 '19 at 07:19
  • @HCSF: Well sure, skylake + width=128 should be strictly better than generic for running on SKX. GCC *could* do worse if it bloats the code-size with AVX512 EVEX-encoded instructions unnecessarily (e.g. `vmovdqu64 xmm` instead of `vmovdqu xmm`, when not using xmm16..31), and generally compare-into-mask should be good vs. the SSE/AVX way of compare-into-vector and blend. **But you should definitely test with the default width=256, too, in case the turbo penalty is worth it for your code.** Doing twice as much work per uop is very good, and the big penalties only kick in with 512-bit vectors. – Peter Cordes Jul 03 '19 at 07:34
  • 1
    I actually see what you just mentioned -- `vmovdqu64 (%rdx),%xmm0`, `vmovdqu64 0x10(%rsi),%xmm6`, etc when I compiled with `-march=skylake-avx512 -mprefer-vector-width=128`. It seems like GCC 8.2 isn't doing it right (or not what you expected)? – HCSF Jul 03 '19 at 07:39
  • @HCSF: Yes, that's a missed optimization in GCC that hurts code size, but otherwise isn't a problem. If GCC isn't getting any benefit from AVX512 features like more registers or masking, or new instructions like `vpternlogd xmm`, then try `-mno-avx512f` as well to see if the code-size effect makes a difference. But most instructions have a SIMD element size, so there's no separate mnemonic for the EVEX version that allows per-element masking. Thus the assembler can assemble `vpaddd %xmm` to the VEX version, and GCC can't shoot itself in the foot. (except by using xmm16..31) – Peter Cordes Jul 03 '19 at 07:43
  • Tried `-march=skylake-avx512 -mprefer-vector-width=128 -mno-avx512f` doesn't even change the size of my binary by 1 byte (I used `strip` command to remove text stuffs first) – HCSF Jul 03 '19 at 08:03
  • @HCSF: It might slightly change code layout inside some functions; function entry points are still padded to 16 bytes. But yeah if you don't have a lot of vector `mov` instructions or other cases for this missed optimization, GCC's `.p2align` directives are going to pad that space back out unless you happen to shrink across an alignment boundary. So no large-scale L1i cache pressure effect, and probably no uop-cache or other front-end effect either. Actually it might be just `vmovdq[au]64` where this happens: gcc still uses `vpand` not `vpandq` https://gcc.godbolt.org/z/j_qysC – Peter Cordes Jul 03 '19 at 08:10
  • I actually `diff` between the assembly code of the two binaries. And you are right that with `-mno-avx512f`, `vmovdqu64` isn't used anymore. I guess it is better to set `-mno-avx512f -mno-avx512pf -mno-avx512er -mno-avx512cd -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512ifma -mno-avx512vbmi -mno-avx512vbmi2 -mno-avx512bf16 -mno-avx512bitalg -mno-avx512vpopcntdq -mno-avx512vp2intersect -mno-avx5124fmaps -mno-avx512vnni -mno-avx5124vnniw` to avoid all unnecessary avx512 related instructions? Thought? – HCSF Jul 03 '19 at 08:59
  • @HCSF: lol. All of those other extensions depend on AVX512F, so the way GCC works is that `-mno-avx512f` will disable them all. Just like `-mno-avx` disables AVX2 and FMA instructions. But anyway, depending on your code having AVX512VL available can *help*, e.g. for `vpternlogd` or for masked instructions. It would be a mistake to *always* use `-mno-avx512f` along with `-mprefer-vector-width=128`, without letting the compiler take a stab at using AVX512VL. (AVX512 Vector Length is the extension that provides 128 and 256-bit versions of instructions. It's separate because Xeon Phi lacks it) – Peter Cordes Jul 03 '19 at 09:09
  • It seems that if I allow AVX512VL, it might use some 256-bit registers? Tho, it sounds like even if I use `-mprefer-vector-width=128`, gcc might still use 256-bit registers? I just checked that I don't see any `ymm` register in my decompiled code. – HCSF Jul 03 '19 at 09:37
  • @HCSF: huh? AVX2 allows gcc to use 256-bit registers if it wants to. But `-mprefer-vector-width=128` makes it not want to. You almost always want AVX + AVX2 enabled even if only using 128-bit vectors, for unaligned memory operands and for 3-operand non-destructive stuff. Disabling AVX512F is just to stop gcc from using longer EVEX instructions sometimes, not to stop it from using 256 or 512-bit instructions. – Peter Cordes Jul 03 '19 at 09:40
  • oh, you mean I shouldn't always use `-mno-avx512f` along with `-mprefer-vector-width=128` because that would disable `AVX512VL` completely. However, with `-mprefer-vector-width=128` alone, `AVX512VL` is still possible but that it won't use 256 bits? – HCSF Jul 03 '19 at 09:45
  • @HCSF: Right. You usually want to give gcc the option of using any 128-bit SIMD instructions your CPU supports, even if they're only available with AVX512 encodings. Again `vpternlogd` is really really good if you ever have boolean functions, and unsigned or 64-bit <-> float or double are much more efficient with AVX512. And vpermt2d is a really powerful 2-input shuffle, better than shufps. The only reason for disabling AVX512 is in case gcc shoots itself in the foot with mask regs or code-size, not because of 256-bit vectors. – Peter Cordes Jul 03 '19 at 09:50
  • Does your answer imply that running AVX2 instructions with `xmm` only has no penalty that could be with `ymm`-operand instructions? – xiver77 May 18 '22 at 19:43
  • If so, would compiling manually vectorized code optimized for using `xmm` registers with the `-mavx2` flag have no penalty compared to being compiled with `-msse4.2`? – xiver77 May 18 '22 at 19:46
  • @xiver77: Yes, same for AVX-512VL instructions like `vpternlogd xmm`. AFAIK, it's only ever the width (and light vs. heavy) that matters, not what ISA extension introduced it. If you're worried about turbo penalties even from YMM, use `-march=native -mprefer-vector-width=128` so the compiler won't use YMM for copying or initializing structs, either. – Peter Cordes May 18 '22 at 19:46
  • @PeterCordes Given Intel have recently removed AVX512 from the consumer lineup, do you think they will remove these power/downclock constraints? Or they can't for technical reasons? – user997112 Feb 17 '23 at 15:12
  • @user997112: There will probably still be cases where the L0 license is higher than the L1 license (see BeeOnRope's answer to this question), so there will be CPUs that need to downclock slightly for "heavy" 256-bit instructions, even before reaching thermal limits. AVX downclocking has been a thing since Haswell, IIRC, although the effect was much stronger with 512-bit instructions in some CPUs. – Peter Cordes Feb 17 '23 at 19:22