0

Intel engineers wrote that we should use VZEROUPPER/VZEROALL to avoid costly transition to non-VEX state on all processors, including future Xeon processor, but not on Xeon Phi: https://software.intel.com/pt-br/node/704023

People have also measured and found out that VZEROUPPER and VZEROALL are expensive on Knights Landing:

36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).

See the above link.

So my code will be the following, if I have just used ymm0 and ymm1:

if [we are running on a Xeon Phi]
     vpxor       ymm0,ymm0,ymm0
     vpxor       ymm1,ymm1,ymm1
else
     vzeroall
endif

How can I detect Xeon Phi (Knights Landing and later Xeon Phi processors) to implement the above code?

We now have the following situation now about the VZEROUPPER/VZEROALL:

  1. These instructions are not needed and are very costly on Xeon Phi Knight Landing 36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
  2. These instructions are very cheap and are needed on Xeon and Core processors (Skylake/Kaby Lake) and will be needed for Xeon in the foreseeble future, to avoid costly transition to non-VEX state.

The advertising materials claim that Xeon Phi (Knights Landing) is fully compatible with other Xeon processors.

Is there a reliable way to detect Xeon Phi, for the purpose of avoiding VZEROUPPER/VZEROALL?

There is an article "How to detect Knights Landing AVX-512 support (Intel® Xeon Phi™ processor)" by James R., Updated February 22, 2016, but it only focuses specific new instructions that became available on the Knights Landing. So it is still not very clear about the VEX transitions.

It would have been good to know whether Intel plans to implement a CPUID bit to show whether non-VEX state are costly? For example:

  • Bit is set to 0 - VEX state transitions are costly, but VZEROUPPER/VZEROALL are cheap and should be used to clear the state;
  • Bit is set to 1 – there is no transition penalty, VZEROUPPER/VZEROALL is not needed.

The above mentioned article about detecting Knights Landing suggests to check the bits AVX-512F+CD+ER+PF as introduced in Knights Landing.

So the code suggests to check all these bits at once, and if all are set, then we are on the Knights Landing:

uint32_t avx2_bmi12_mask = (1 << 16) | // AVX-512F
                           (1 << 26) | // AVX-512PF
                           (1 << 27) | // AVX-512ER
                           (1 << 28);  // AVX-512CD

It would have been good to know whether Intel plans to add these all bits to a simple Xeon (non Phi) or Core processors in the near future, so they will also support the AVX-512F+CD+ER+PF features introduced in the Knight Landding?

In case that Xeon and Core processor will support AVX-512F+CD+ER+PF, we won’t be able to distinguish Xeon from Xeon Phi.

Please advise.

Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 1
    Surely you know the target CPU at compile-time, so you can just use a preprocessor macro ? – Paul R Jun 10 '17 at 07:50
  • @PaulR, no, we don't know target CPU at compile time, we write mass-market application that can be launched practically anywhere, and the Knights Landing can be used as a main processor to run any application. – Maxim Masiutin Jun 10 '17 at 13:51
  • 1
    Right, but you could compile CPU-specific versions of your optimised functions and use a despatcher at run-time - this is what Intel does in its optimised libraries. Your code would then be more efficient as it wouldn't need to do any run-time checks (apart from the first time the despatcher is called). – Paul R Jun 10 '17 at 13:58
  • @PaulR - anyway, I should somehow detect the Xeon Phi, even once at startup, and I have no idea of how to reliably detect it. – Maxim Masiutin Jun 10 '17 at 14:04
  • 1
    Does this help: https://software.intel.com/en-us/articles/how-to-detect-knl-instruction-support ? – Paul R Jun 10 '17 at 15:46
  • @PaulR - thank you! I have updated the question to address the issue that you have raised about the "how to detect" article. – Maxim Masiutin Jun 11 '17 at 03:07
  • 1
    Can't you check supported features and then `cpuid.family == B` http://www.sandpile.org/x86/cpuid.htm – ta.speot.is Jun 11 '17 at 03:21
  • @PaulR -- OK, will do! Thank you, Paul! – Maxim Masiutin Jun 11 '17 at 03:32
  • 1
    One thing to watch out for is that Skylake Xeon (Purley) will support AVX512 - I don't know whether it has the VEX switching penalty though. – Paul R Jun 11 '17 at 06:59
  • 1
    It isn't a problem if you write the code correctly. You only need to execute this costly instruction *once*, and then as long as you never mix VEX and non-VEX instructions, you don't need to use it again. Worst case, you waste ~30 cycles one time on KNL, which (1) isn't a very common CPU, and (2) is a sufficiently fast CPU that these wasted cycles are going to easily be made up. – Cody Gray - on strike Jun 12 '17 at 15:07
  • 1
    @CodyGray: true, so long as you never make any calls to system or third party libraries - unfortunately these may well contain non-VEX SSE code. Having said that, SIMD code should be operating at sufficient granularity that a 30 cycle overhead is no big deal - if you're mixing system/library calls and SIMD code so finely that this makes a difference then you're probably not going to be operating at peak efficiency anyway. – Paul R Jun 15 '17 at 08:13
  • 1
    @PaulR - thank you, I have already decided to VEX-prefix all instructions that I can, and as about the other libraries - I will make sure that they are granular, so, as you wrote, 30 cycle is no big deal. I have decided to refrain from using VZEROUPPER/VZEROALL altogether. I have made tests and found out that if your code is granular, issuing even a single VZEROUPPER once in the entire program (single-threaded) makes all subsequent non-VEX code 30% slower than if we don't call VZEROUPPER at all after VEX code. What a paradox! I can publish the test results if you wish. Is that a known behavor? – Maxim Masiutin Jun 15 '17 at 08:33
  • 1
    @PaulR - VZEROUPPER makes subseqent non-VEX code slower only under Kaby Lake and Skylake. Under previous Intel microprocessor architectures, it doesn't matter. – Maxim Masiutin Jun 15 '17 at 08:35

1 Answers1

1

If you specifically want to check for being on a KNL (rather than the more general "Does the CPU I am running on have feature X?") you can do that by looking at the "Extended Family", "Family" and "Model" fields in %eax after calling cpuid with %eax==1 and %ecx == 0. C++ code something like that below will do the job.

However, as others are implicitly pointing out, this is a very specific test, and will, for instance, fail on future Knights cores, so you would likely be better doing as has been suggested and checking for AVX-512 features that are not in Xeon, so AVX512-ER and AVX512-PF. (Of course, such instructions could appear in future Xeons, so this is not guaranteed in the long term, but, quoting Keynes: "In the long term we're all dead" :-))

class cpuidState
{
    uint32_t orig_eax;                      /* Values sent in to the cpuid instruction */
    uint32_t orig_ecx;

    uint32_t eax;                           /* Values received back from it. */
    uint32_t ebx;
    uint32_t ecx;
    uint32_t edx;

    void cpuid()
    {
        __asm__ __volatile__("cpuid"
                             : "+a" (eax), "=b" (ebx), "+c" (ecx), "=d" (edx));
    }

    void update (uint32_t eaxVal, uint32_t ecxVal)
    {
        orig_eax = eaxVal;
        orig_ecx = ecxVal;
        eax      = eaxVal;
        ecx      = ecxVal;
        cpuid();
    }

    void ensureCorrectLeaf(uint32_t eaxVal, uint32_t ecxVal)
    {
        if (orig_eax != eaxVal || orig_ecx != ecxVal)
            update (eaxVal, ecxVal);
    }

 public:
    cpuidState() : orig_eax (-1), orig_ecx(-1) { }

    // Include the Extended Model in the test. Without it we see some Xeons as KNL :-(
    bool onKNL()            { ensureCorrectLeaf(1,0); return (eax & 0x0f0ff0) == 0x50670; }    
};
Jim Cownie
  • 2,409
  • 1
  • 11
  • 20