12

Say, I want to clear 4 zmm registers.

Will the following code provide the fastest speed?

vpxorq  zmm0, zmm0, zmm0
vpxorq  zmm1, zmm1, zmm1
vpxorq  zmm2, zmm2, zmm2
vpxorq  zmm3, zmm3, zmm3

On AVX2, if I wanted to clear ymm registers, vpxor was fastest, faster than vxorps, since vpxor could run on multiple units.

On AVX512, we don't have vpxor for zmm registers, only vpxorq and vpxord. Is that an efficient way to clear a register? Is the CPU smart enough to not make false dependencies on previous values of the zmm registers when I clear them with vpxorq?

I don't yet have a physical AVX512 CPU to test that - maybe somebody has tested on the Knights Landing? Are there any latencies published

Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 2
    The instruction set, like AVX2 and AVX512, doesn't determine performance like you're implying. It depends on the actual microarchitecture implementation. Cannondale could easily have a very different AVX512 implementation than Knights-Landing. – Ross Ridge Jun 16 '17 at 02:33
  • @RossRidge - yes, you are right. I've updated the question that I'm insterested on Knights Landing. – Maxim Masiutin Jun 16 '17 at 03:38
  • 2
    As I understand the AVX instruction set, `vpxor xmm, xmm, xmm` clears the upper part of the destination register. Reference: Intel® 64 and IA-32 Architectures Software Developer’s Manual *2.3.10.1 Vector Length Transition and Programming Considerations [...] Programmers should bear in mind that instructions encoded with VEX.128 and VEX.256 prefixes will clear any future extensions to the vector registers.[...]* – EOF Jun 16 '17 at 06:02
  • 1
    Write a small test program using intrinsics and see what a decent compiler (e.g. ICC) generates for this. – Paul R Jun 16 '17 at 06:21
  • @PaulR - Thank you! Good idea! – Maxim Masiutin Jun 16 '17 at 06:25

3 Answers3

13

The most efficient way is to take advantage of AVX implicit zeroing out to VLMAX (the maximum vector register width, determined by the current value of XCR0):

vpxor  xmm6, xmm6, xmm6
vpxor  xmm7, xmm7, xmm7
vpxor  xmm8, xmm0, xmm0   # still a 2-byte VEX prefix as long as the source regs are in the low 8
vpxor  xmm9, xmm0, xmm0

These are only 4-byte instructions (2-byte VEX prefix), instead of 6 bytes (4-byte EVEX prefix). Notice the use of source registers in the low 8 to allow a 2-byte VEX even when the destination is xmm8-xmm15. (A 3-byte VEX prefix is required when the second source reg is x/ymm8-15). And yes, this is still recognized as a zeroing idiom as long as both source operands are the same register (I tested that it doesn't use an execution unit on Skylake).

Other than code-size effects, the performance is identical to vpxord/q zmm and vxorps zmm on Skylake-AVX512 and KNL. (And smaller code is almost always better.) But note that KNL has a very weak front-end, where max decode throughput can only barely saturate the vector execution units and is usually the bottleneck according to Agner Fog's microarch guide. (It has no uop cache or loop buffer, and max throughput of 2 instructions per clock. Also, average fetch throughput is limited to 16B per cycle.)

Also, on hypothetical future AMD (or maybe Intel) CPUs that decode AVX512 instructions as two 256b uops (or four 128b uops), this is much more efficient. Current AMD CPUs (including Ryzen) don't detect zeroing idioms until after decoding vpxor ymm0, ymm0, ymm0 to 2 uops, so this is a real thing. Old compiler versions got it wrong (gcc bug 80636, clang bug 32862), but those missed-optimization bugs are fixed in current versions (GCC8, clang6.0, MSVC since forever(?). ICC still sub-optimal.)


Zeroing zmm16-31 does need an EVEX-encoded instruction; vpxord or vpxorq are equally good choices. EVEX vxorps requires AVX512DQ for some reason (unavailable on KNL), but EVEX vpxord/q is baseline AVX512F.

vpxor   xmm14, xmm0, xmm0
vpxor   xmm15, xmm0, xmm0
vpxord  zmm16, zmm16, zmm16     # or XMM if you already use AVX512VL for anything
vpxord  zmm17, zmm17, zmm17

EVEX prefixes are fixed-width, so there's nothing to be gained from using zmm0.

If the target supports AVX512VL (Skylake-AVX512 but not KNL) then you can still use vpxord xmm31, ... for better performance on future CPUs that decode 512b instructions into multiple uops.

If your target has AVX512DQ (Skylake-AVX512 but not KNL), it's probably a good idea to use vxorps when creating an input for an FP math instruction, or vpxord in any other case. No effect on Skylake, but some future CPU might care. Don't worry about this if it's easier to always just use vpxord.


Related: the optimal way to generate all-ones in a zmm register appears to be vpternlogd zmm0,zmm0,zmm0, 0xff. (With a lookup-table of all-ones, every entry in the logic table is 1). vpcmpeqd same,same doesn't work, because the AVX512 version compares into a mask register, not a vector.

This special-case of vpternlogd/q is not special-cased as independent on KNL or on Skylake-AVX512, so try to pick a cold register. It is pretty fast, though, on SKL-avx512: 2 per clock throughput according to my testing. (If you need multiple regs of all-ones, use on vpternlogd and copy the result, esp. if your code will run on Skylake and not just KNL).


I picked 32-bit element size (vpxord instead of vpxorq) because 32-bit element size is widely used, and if one element size is going to be slower, it's usually not 32-bit that's slow. e.g. pcmpeqq xmm0,xmm0 is a lot slower than pcmpeqd xmm0,xmm0 on Silvermont. pcmpeqw is another way of generating a vector of all-ones (pre AVX512), but gcc picks pcmpeqd. I'm pretty sure it will never make a difference for xor-zeroing, especially with no mask-register, but if you're looking for a reason to pick one of vpxord or vpxorq, this is as good a reason as any unless someone finds a real perf difference on any AVX512 hardware.

Interesting that gcc picks vpxord, but vmovdqa64 instead of vmovdqa32.


XOR-zeroing doesn't use an execution port at all on Intel SnB-family CPUs, including Skylake-AVX512. (TODO: incorporate some of this into that answer, and make some other updates to it...)

But on KNL, I'm pretty sure xor-zeroing needs an execution port. The two vector execution units can usually keep up with the front-end, so handling xor-zeroing in the issue/rename stage would make no perf difference in most situations. vmovdqa64 / vmovaps need a port (and more importantly have non-zero latency) according to Agner Fog's testing, so we know it doesn't handle those in the issue/rename stage. (It could be like Sandybridge and eliminate xor-zeroing but not moves. But I doubt it because there'd be little benefit.)

As Cody points out, Agner Fog's tables indicate that KNL runs both vxorps/d and vpxord/q on FP0/1 with the same throughput and latency, assuming they do need a port. I assume that's only for xmm/ymm vxorps/d, unless Intel's documentation is in error and EVEX vxorps zmm can run on KNL.

Also, on Skylake and later, non-zeroing vpxor and vxorps run on the same ports. The run-on-more-ports advantage for vector-integer booleans is only a thing on Intel Nehalem to Broadwell, i.e. CPUs that don't support AVX512. (It even matters for zeroing on Nehalem, where it actually needs an ALU port even though it is recognized as independent of the old value).

The bypass-delay latency on Skylake depends on what port it happens to pick, rather than on what instruction you used. i.e. vaddps reading the result of a vandps has an extra cycle of latency if the vandps was scheduled to p0 or p1 instead of p5. See Intel's optimization manual for a table. Even worse, this extra latency applies forever, even if the result sits in a register for hundreds of cycles before being read. It affects the dep chain from the other input to the output, so it still matters in this case. (TODO: write up the results of my experiments on this and post them somewhere.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • @Cody: thanks for the edit. The VLMAX I was referring to is the `DEST[VLMAX-1:128] ← 0` in the Operation section of [insn set ref manual entries](http://felixcloutier.com/x86/ANDPS.html). The OS can't modify that part of XCR0, can it? If so, that implies that `vpxor xmm0` could leave the upper 256b of zmm0 unmodified with the right combination of settings. And that by re-enabling 512b vectors later, you could see the old contents? Or does changing VLMAX imply a vzeroupper or something, allowing the CPU to actually always zero all the way? – Peter Cordes Jun 30 '17 at 11:53
  • I believe the OS can change it from ring 0, but I don't know why that would happen dynamically. Normally, it would be something like a boot flag that disables AVX support. And I think it would be the OS's responsibility to issue VZEROUPPER if necessary, like maybe for a VM environment that supported dynamically toggling ISA support? I don't know if those exist! The thing I was unclear on is whether `VLMAX` would be set to 128 when running in SSE-compatibility mode ([state C here](https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx)). – Cody Gray - on strike Jun 30 '17 at 12:18
  • @CodyGray: Ah I see. Note that the SSE-compatibility mode is not an architecturally-visible thing. It only manifests as a performance effect, so you can be sure that the current microarchitectural SSE/AVX "state" doesn't change an instruction's effect on the architectural state. – Peter Cordes Jun 30 '17 at 12:21
  • Re: lack of VZEROUPPER: if it doesn't happen implicitly, then that might imply that without it, the CPU would need to preserve those contents (e.g. with a false dependency for every instruction). Not that it would be useful or usable with "normal" OSes and VMs, but the CPU would have to do it anyway unless they wrote the rules to allow it not to. (e.g. contents are allowed to be "undefined", or must-be-zero, rather than what they were before reducing VLMAX). – Peter Cordes Jun 30 '17 at 12:26
  • Of course, this question is specifically about Knights Landing, where you aren't supposed to use VZEROUPPER because it is *very* slow. And you aren't even supposed to need it. IIRC, the performance penalty for mixing legacy-SSE and VEX-encoded AVX instructions is minimal. Honestly, I'm still a bit confused about things work on KNL. It is a very different world, and I don't have one to play with. Anyway, this is a clever solution to decrease the size of the instructions. I wasn't thinking about code *size* when I wrote my answer, so I completely missed this. – Cody Gray - on strike Jun 30 '17 at 12:35
  • @CodyGray: AFAIK `vzeroupper` is only useful on KNL for its effect on register contents (e.g. to avoid information leaks between contexts that don't trust each other). There is *no* perf penalty for mixing [E]VEX and non-VEX, according to what I've read. KNL's strategy for SSE instructions that leave the upper bits unmodified is that they simply have a false dep on the old value. This is sensible because Xeon Phi is such specialized hardware that anything performance-critical should always be compiled specifically for it. – Peter Cordes Jun 30 '17 at 13:02
  • @Cody: But these are just performance considerations. I'm wondering what the rules imply about correctness. If the docs say the upper bits have to be preserved after shrinking VLMAX, then using an AVX instruction, then growing VLMAX, then the hardware has to make that happen somehow (or it's a doc or CPU bug). I'm guessing that something in the rules lets it get away with *not* preserving that architectural state, otherwise it might need a special mode to give instructions false dependencies, and it seems crazy that they'd spend transistors on that instead of writing the rules to avoid it. – Peter Cordes Jun 30 '17 at 13:05
5

Following Paul R's advice of looking to see what code compilers generate, we see that ICC uses VPXORD to zero-out one ZMM register, then VMOVAPS to copy this zeroed XMM register to any additional registers that need to be zeroed. In other words:

vpxord    zmm3, zmm3, zmm3
vmovaps   zmm2, zmm3
vmovaps   zmm1, zmm3
vmovaps   zmm0, zmm3

GCC does essentially the same thing, but uses VMOVDQA64 for ZMM-ZMM register moves:

vpxord      zmm3, zmm3, zmm3
vmovdqa64   zmm2, zmm3
vmovdqa64   zmm1, zmm3
vmovdqa64   zmm0, zmm3

GCC also tries to schedule other instructions in-between the VPXORD and the VMOVDQA64. ICC doesn't exhibit this preference.

Clang uses VPXORD to zero all of the ZMM registers independently, a la:

vpxord  zmm0, zmm0, zmm0
vpxord  zmm1, zmm1, zmm1
vpxord  zmm2, zmm2, zmm2
vpxord  zmm3, zmm3, zmm3

The above strategies are followed by all versions of the indicated compilers that support generation of AVX-512 instructions, and don't appear to be affected by requests to tune for a particular microarchitecture.


This pretty strongly suggests that VPXORD is the instruction you should be using to clear a 512-bit ZMM register.

Why VPXORD instead of VPXORQ? Well, you only care about the size difference when you're masking, so if you're just zeroing a register, it really doesn't matter. Both are 6-byte instructions, and according to Agner Fog's instruction tables, on Knights Landing:

  • Both execute on the same number of ports (FP0 or FP1),
  • Both decode to 1 µop
  • Both have a minimum latency of 2, and a reciprocal throughput of 0.5.
    (Note that this last bullet highlights a major disadvantage of KNL—all vector instructions have a latency of at least 2 clock cycles, even the simple ones that have 1-cycle latencies on other microarchitectures.)

There's no clear winner, but compilers seem to prefer VPXORD, so I'd stick with that one, too.

What about VPXORD/VPXORQ vs. VXORPS/VXORPD? Well, as you mention in the question, packed-integer instructions can generally execute on more ports than their floating-point counterparts, at least on Intel CPUs, making the former preferable. However, that isn't the case on Knights Landing. Whether packed-integer or floating-point, all logical instructions can execute on either FP0 or FP1, and have identical latencies and throughput, so you should theoretically be able to use either. Also, since both forms of instructions execute on the floating-point units, there is no domain-crossing penalty (forwarding delay) for mixing them like you would see on other microarchitectures. My verdict? Stick with the integer form. It isn't a pessimization on KNL, and it's a win when optimizing for other architectures, so be consistent. It's less you have to remember. Optimizing is hard enough as it is.

Incidentally, the same is true when it comes to deciding between VMOVAPS and VMOVDQA64. They are both 6-byte instructions, they both have the same latency and throughput, they both execute on the same ports, and there are no bypass delays that you have to be concerned with. For all practical purposes, these can be seen as equivalent when targeting Knights Landing.

And finally, you asked whether "the CPU [is] smart enough not to make false dependencies on the previous values of the ZMM registers when [you] clear them with VPXORD/VPXORQ". Well, I don't know for sure, but I imagine so. XORing a register with itself to clear it has been an established idiom for a long time, and it is known to be recognized by other Intel CPUs, so I can't imagine why it wouldn't be on KNL. But even if it's not, this is still the most optimal way to clear a register.

The alternative would be something like moving in a 0 value from memory, which is not only a substantially longer instruction to encode but also requires you to pay a memory-access penalty. This isn't going to be a win…unless maybe you were throughput-bound, since VMOVAPS with a memory operand executes on a different unit (a dedicated memory unit, rather than either of the floating-point units). You'd need a pretty compelling benchmark to justify that kind of optimization decision, though. It certainly isn't a "general purpose" strategy.

Or maybe you could do a subtraction of the register with itself? But I doubt this would be any more likely to be recognized as dependency-free than XOR, and everything else about the execution characteristics will be the same, so that's not a compelling reason to break from the standard idiom.

In both of these cases, the practicality factor comes into play. When push comes to shove, you have to write code for other humans to read and maintain. Since it's going to cause everyone forever after who reads your code to stumble, you'd better have a really compelling reason for doing something odd.


Next question: should we repeatedly issue VPXORD instructions, or should we copy one zeroed register into the others?

Well, VPXORD and VMOVAPS have equivalent latencies and throughputs, decode to the same number of µops, and can execute on the same number of ports. From that perspective, it doesn't matter.

What about data dependencies? Naïvely, one might assume that repeated XORing is better, since the move depends on the initial XOR. Perhaps this is why Clang prefers repeated XORing, and why GCC prefers to schedule other instructions in-between the XOR and MOV. If I were writing the code quickly, without doing any research, I'd probably write it the way Clang does. But I can't say for sure whether this is the most optimal approach without benchmarks. And with neither of us having access to a Knights Landing processor, these aren't going to be easy to come by. :-)

Intel's Software Developer Emulator does support AVX-512, but it's unclear whether this is a cycle-exact simulator that would be suitable for benchmarking/optimization decisions. This document simultaneously suggests both that it is ("Intel SDE is useful for performance analysis, compiler development tuning, and application development of libraries.") and that it is not ("Please note that Intel SDE is a software emulator and is mainly used for emulating future instructions. It is not cycle accurate and can be very slow (up-to 100x). It is not a performance-accurate emulator."). What we need is a version of IACA that supports Knights Landing, but alas, that has not been forthcoming.


In summary, it's nice to see that three of the most popular compilers generate high-quality, efficient code even for such a new architecture. They make slightly different decisions in which instructions to prefer, but this makes little to no practical difference.

In many ways, we've seen that this is because of unique aspects of the Knights Landing microarchitecture. In particular, the fact that most vector instructions execute on either of two floating-point units, and that they have identical latencies and throughputs, with the implication being that there are no domain-crossing penalties you need to be concerned with and you there's no particular benefit in preferring packed-integer instructions over floating-point instructions. You can see this in the core diagram (the orange blocks on the left are the two vector units):

Diagram/schematic of Intel's Knights Landing microprocessor core, showing there are only 2 vector units.

Use whichever sequence of instructions you like the best.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • Thank you the high-quality answer! I will change the instruction to `vpxord` fom `vpxorq`. As soon as I will be able to run an application on a real Knights Landing, I will let you know. – Maxim Masiutin Jun 16 '17 at 09:34
  • 1
    Hmm, I didn't suggest to change `vpxord` to `vpxorq`. I said it doesn't make a difference, and that I would just stick with `vpxord` since that's what compilers emit. You can certainly change it if you want for testing purposes, but don't do it because I advised it! – Cody Gray - on strike Jun 16 '17 at 09:39
  • :-) I will do that because Clang does so, and it it was you who told me that Clang does so :-) – Maxim Masiutin Jun 16 '17 at 09:41
  • 3
    There's one corner case situation where it's beneficial to `xor` instead of `mov`. When the zero'ed register is immediately fed into another instruction that overwrites it. Using `mov` in that case requires an extra zeroed register to move from, whereas `xor` doesn't. So it may result in register pressure. – Mysticial Jun 16 '17 at 19:19
  • 2
    This is extremely rare though. Since almost all SIMD instructions (since AVX) are non-destructive. The only exceptions being the FMAs, 2-reg permutes, and blend-masking. For zero inputs, FMAs degenerate and blend-masking reduces to zero-masking. So the only thing left are the 2-reg permutes and the IFMA52. And even in these cases, you have to run out of 32 registers for it to matter. – Mysticial Jun 16 '17 at 19:20
  • Why don't the compiler just do a EVEX-prefixed instruction to clear 256-bit register - it should automatically clear highest bits 511-256, isn't it? – Maxim Masiutin Jun 25 '17 at 01:22
  • 1
    Yes, that's what it does, @Maxim. `VPXORD` has an EVEX prefix. For example, the byte encoding for `VPXORD zmm0, zmm0, zmm0` is `62 F1 7D 48 EF C0`; the first 4 bytes are the EVEX prefix, with [the initial 62h being the dead give-away](http://agner.org/optimize/blog/read.php?i=288). – Cody Gray - on strike Jun 25 '17 at 07:21
  • @CodyGray - I mean an EVEX prefix to operate with an ymm register, not zmm? By using an EVEX prefix on an ymm register, higher bits will be cleared. So we won't need to explicitly specify zmm in the instruction. – Maxim Masiutin Jun 25 '17 at 07:25
  • 1
    EVEX also allows you to operate on 512-byte registers (ZMM). What you're proposing wouldn't be any shorter to do `vpxord ymm0, ymm0, ymm0`—still 6 bytes (`62 F1 7D 28 EF C0`), so no advantage. Aside from that, I'm actually not sure that operations on YMM registers actually clear the upper 256 bits. I know that operations on XMM registers do *not* clear the upper 128 bits, so I imagine the same is true for operations on YMM registers. You'd have to check in the manual to be sure. @Maxim – Cody Gray - on strike Jun 25 '17 at 07:30
  • @CodyGray - no -- operating XMM registers with VEX prefix (AVX operations on XMM) _clears_ highest bits YMM registers - the Intel manual tells this many times. This is analogous to operating 32-bit operands of 64-bit registers in 64-bit mode. For example, `mov eax, ebx` also clears highest bits (63-32) of `rax`. – Maxim Masiutin Jun 25 '17 at 07:36
  • 1
    Ah, yes, you're exactly right. The VEX prefix is what changes the behavior, causing the upper bits to be cleared. It isn't exactly like the 64-bit mode, though, because that *always* clears the upper bits. In compatibility mode, SSE instructions don't. You have to use the form with the VEX prefix. There was some discussion about why this decision was made [here](https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853). Anyway, like I said, though, no advantage here. @Maxim – Cody Gray - on strike Jun 25 '17 at 07:40
  • In 32bit/64bit mode with general registers (EAX/RAX), the prefix is free in terms of performance, it only affects instruction size. But when you deal with XMM/YMM registers, the prefix is not free. You have either always use the prefix, and your code will be called "AVX", or never use the prefix, and the code will be called "SSE". You cannot mix the instructions _with_ and _without_ the prefix; otherwise, the transitional penalties will occur, because the CPU will save higher bits of the registers somewhere. – Maxim Masiutin Jun 25 '17 at 07:50
  • 1
    Prefixes are not free in terms of performance precisely *because* it affects instruction size. Each prefix increases the time it takes to decode the instruction, which makes the code slower. Not to mention the fact that larger code doesn't fit as well in the cache, and tends to evict *other* code from the cache. Definitely a performance penalty involved. You see this when you use 16-bit operands, too. In 32-bit mode, those require an operand-size-override prefix, and such code benchmarks *substantially* slower than if you had used 32-bit operands. Yes, I've tested it. @Maxim – Cody Gray - on strike Jun 25 '17 at 07:52
  • 2
    @MaximMasiutin: mixing VEX and EVEX is totally fine, because AVX was correctly designed to avoid a repeat of the SSE/AVX mixing problem (by implicitly zeroing out to VLMAX as you point out). This is why `vpxor xmm15,xmm0,xmm0` is the best way to zero `zmm15` (4-byte instruction instead of 6, [as I explain in my answer](https://stackoverflow.com/questions/44578967/what-is-the-most-efficient-way-to-clear-a-single-or-a-few-zmm-registers-on-knigh/44841054#44841054)). – Peter Cordes Jun 30 '17 at 07:42
  • 1
    Also, @Cody: EVEX `vxorps zmm` surprisingly requires AVX512DQ, which KNL doesn't support! So `vpxord/q` is your only good option. – Peter Cordes Jun 30 '17 at 07:43
  • 1
    xor-zeroing instructions don't need any execution port on Intel Skylake-AVX512, so that concern in the question is a red herring. (And even if it did, Skylake changed things so FP booleans also run on all ports, with the odd behaviour that bypass delay now depends on which port the uop was scheduled to, rather than whether you wrote `vandps` or `vpand`. Also, this latency isn't just a bypass latency; it sticks with the register forever (or until an xsave/xrstor context switch), but this part also applies to Haswell and isn't documented.\) – Peter Cordes Jun 30 '17 at 08:01
  • 1
    I'm pretty sure that repeated `xor` is better than `xor`+`mov`, except maybe on CPUs like AMD Bulldozer-family where xor-zeroing needs an execution port but vector-mov is eliminated. The difference is small enough that gcc hasn't bothered to change this for `-mtune=intel` or anything. – Peter Cordes Jun 30 '17 at 08:03
  • 1
    Agner Fog says KNL recognizes vpxor, vpxord, vpxorq, vxorps, vxorpd as independent of the old value, but *not* subtract or compare (not even the `vpcmpeqd` all-ones idiom). It's really picky; even 64-bit `xor rax,rax` is not recognized, only 32-bit `xor eax,eax`. And for vectors, only VEX/EVEX, not legacy SSE (but that's because it doesn't have special handling for avoiding false deps with legacy SSE code that leaves the upper parts of vectors unmodified). – Peter Cordes Jun 30 '17 at 08:10
  • 1
    Interesting suggestion to load zeros, though. `vmovd xmm0, dword [zeros]` would be an interesting thing to try in a loop that bottlenecked on FPU0/1 throughput and had spare cycles for the load port. Except that KNL's front-end is almost always the bottleneck, not execution units, according to Agner Fog. (decoder max throughput of 2 insns per clock, and no loop buffer. multi-uop instructions are very slow to decode, so only a long micro-coded instruction could bottleneck on ALU ports.) – Peter Cordes Jun 30 '17 at 08:23
  • 1
    SDE's emulation of instruction-set extensions doesn't even try to be cycle-accurate. I think their claim that you can use it for "performance analysis" is for other things you can do with the Pin framework it uses, like giving you dynamic instruction counts and other [dynamic binary-instrumentation stuff](https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool). – Peter Cordes Jun 30 '17 at 08:37
  • @PeterCordes - Peter Cordes also thinks that implicit clearing of lower registers (ymm) also clears higher bits (zmm), so just clearing ymm is enough -- there is no need to explicitly clear the zmm registers - that's what was my comment about ("Why don't the compiler just do a EVEX-prefixed instruction to clear 256-bit register - it should automatically clear highest bits 511-256, isn't it? – Maxim Masiutin Jun 25 at 1:22") --- I just wasn't sure that we don't necessarily need the EVEX prefix. Just VEX would have been enough. – Maxim Masiutin Jun 30 '17 at 11:26
2

I put together a simple C test program using intrinsics and compiled with ICC 17 - the generated code I get for zeroing 4 zmm registers (at -O3) is:

    vpxord    %zmm3, %zmm3, %zmm3                           #7.21
    vmovaps   %zmm3, %zmm2                                  #8.21
    vmovaps   %zmm3, %zmm1                                  #9.21
    vmovaps   %zmm3, %zmm0                                  #10.21
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Thank you! What do the numbers `#7.21` mean? – Maxim Masiutin Jun 16 '17 at 08:59
  • 1
    They're just comments added by the compiler, @Maxim. The 7, 8, 9, and 10 are line numbers from the source code. The 21 appears to be a column number where the intrinsic begins. – Cody Gray - on strike Jun 16 '17 at 09:25
  • Why don't the compiler just do a EVEX-prefixed instruction to clear 256-bit register - it should automatically clear highest bits 511-256, isn't it? – Maxim Masiutin Jun 25 '17 at 01:22
  • @MaximMasiutin: did you write that backwards? Using `vpxor ymm0,ymm0,ymm0` to clear zmm0? IDK why you'd want to use an EVEX instruction if you only cared about the ymm part. The reverse is a good idea, though, [see my answer](https://stackoverflow.com/questions/44578967/what-is-the-most-efficient-way-to-clear-a-single-or-a-few-zmm-registers-on-knigh/44841054#44841054). – Peter Cordes Jun 30 '17 at 07:36