Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support?

Question

In this question, it is confirmed that __builtin_cpu_supports("avx2") doesn't check for OS support. (Or at least, it didn't before GCC fixed the bug). From Intel docs, I know that in addition to checking the CPUID bits we need to check something related to the x86-64 instruction xgetbv. The Intel docs linked above provide this code for the check:

int check_xcr0_ymm()
{
    uint32_t xcr0;
#if defined(_MSC_VER)
    xcr0 = (uint32_t)_xgetbv(0);  /* min VS2010 SP1 compiler is required */
#else
    __asm__ ("xgetbv" : "=a" (xcr0) : "c" (0) : "%edx" );
#endif
    return ((xcr0 & 6) == 6); /* checking if xmm and ymm state are enabled in XCR0 */
}

Question: Is this check plus the CPUID check sufficient to guarantee AVX2 instructions won't crash my program?

Bonus Question: What is this check actually doing? Why does it exist? (There is some discussion of this here and here, but I think the topic deserves a dedicated answer).

Notes:

this question is on a similar topic, but the answers don't cover xgetbv.
this question is similar, but asks about Windows specifically. I'm interested in a cross-platform solution.

Peter Cordes · Accepted Answer · 2022-06-12T22:24:52.277

Yes, CPUID + checking those XCR0 bits is sufficient, assuming an OS that isn't broken (and follows the expected conventions).

And assuming a virtual machine or emulator's CPUID instruction doesn't lie and tell you AVX2 is available but then actually fault. But if either of those things happen, it's the OS or VM's fault, not your program's.

(For compat with quite old CPUs, you need to use CPUID to check whether XGETBV is even supported before using it, otherwise that will fault. A good AVX detection function will do this.
See also Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) - my answer there focuses mostly on the latter and isn't Windows specific.)

If you just checked CPUID, you'd find that the CPU supported AVX2 even if that CPU was running an old OS that didn't know about AVX, and only saved/restored XMM registers on context-switch, not YMM.

Intel designed things so the failure mode would be an illegal-instruction fault (#UD) in that case, rather than silently corrupting user-space state if multiple threads / processes used YMM or ZMM registers. (Because that would be horrible.)

(Every task is supposed to have its own private register state, including integer and FP/SIMD registers. Context switching without saving/restore the high halves of the YMM registers would effectively be asynchronously corrupting registers, if you look at program-order execution for a single thread.)

The mechanism is that the OS has to set some bits in XCR0 (extended control-register 0), which user-space can check via xgetbv. If those bits are set, it's effectively a promise that the OS is AVX-aware and will save/restore YMM regs. And that it will set other control-register bits so SSE and AVX instructions actually work without faulting.

I'm not sure if these bits actually affect the CPU behaviour at all, or if they only exist as a communication mechanism for the kernel to advertise AVX support to user-space.

(YMM registers were new with AVX1, and XMM were new with SSE1. The OS doesn't need to know about SSE4.x or AVX2, just how to save the new architectural state. So AVX-512 is the next SIMD extension that needed new OS support.)

Update: I think those XCR0 bits actually do control whether AVX1/2 and AVX-512 instructions will #UD. MacOS's Darwin kernel apparently only does on-demand AVX-512 support, so the first usage will fault (but then the kernel handles it and re-runs, transparently to user-space I hope). See the source:

// darwin-xnu .../i386/fpu.c#L176
 * On-demand AVX512 support
 * ------------------------
 * On machines with AVX512 support, by default, threads are created with
 * AVX512 masked off in XCR0 and an AVX-sized savearea is used. However, AVX512
 * capabilities are advertised in the commpage and via sysctl. If a thread
 * opts to use AVX512 instructions, the first will result in a #UD exception.
 * Faulting AVX512 intructions are recognizable by their unique prefix.
 * This exception results in the thread being promoted to use an AVX512-sized
 * savearea and for the AVX512 bit masks being set in its XCR0. The faulting
 * instruction is re-driven and the thread can proceed to perform AVX512
 * operations.
 *
 * In addition to AVX512 instructions causing promotion, the thread_set_state()
 * primitive with an AVX512 state flavor result in promotion.
 *
 * AVX512 promotion of the first thread in a task causes the default xstate
 * of the task to be promoted so that any subsequently created or subsequently
 * DNA-faulted thread will have AVX512 xstate and it will not need to fault-in
 * a promoted xstate.
 *
 * Two savearea zones are used: the default pool of AVX-sized (832 byte) areas
 * and a second pool of larger AVX512-sized (2688 byte) areas.
 *
 * Note the initial state value is an AVX512 object but that the AVX initial
 * value is a subset of it.
 */

So on MacOS, it seems XGETBV + checking XCR0 might not be a guaranteed way to detect usability of AVX-512 instruction! The comment says "capabilities are advertised in the commpage and via sysctl", so you need some OS-specific way.

But that's AVX-512; probably AVX1/2 is always enabled so checking XCR0 for that will work everywhere, including MacOS.

Lazy context switches used to be a thing

Some OSes used to use "lazy" context switches, not actually saving/restoring the x87, XMM, and maybe even YMM registers until the new process actually used them. This was done by using a separate control-register bit that made those types of instructions fault if executed; in that fault handler, the OS would save state from the previous task on this core, and load state from the new task. Then change the control bit and return to user-space to rerun the instruction.

But these days most processes use XMM (and YMM) registers all over the place, in memcpy and other libc functions, and for copying/initializing structs. So a lazy strategy isn't worth it, and is just a lot of extra complexity, especially on an SMP system. That's why modern kernels don't do that anymore.

The control-register bits that a kernel would use to make x87, xmm, or ymm instructions fault is separate from the XCR0 bit we're checking, so even on a system using lazy context switching, your detection won't be fooled by the OS temporarily having the CPU set up so vaddps xmm0, xmm1, xmm2 would fault.

When SSE1 was new, there was no user-space-visible bit for detecting SSE-aware OSes without using an OS-specific API, but Intel learned from that mistake for AVX. (With SSE, the failure mode is still faulting, not corruption, though. The CPU boots up with SSE instructions set to fault: How do I enable SSE for my freestanding bootable code?)

Thank you so much for this detailed answer Peter. Just one question, is xgetbv safe for older chips (especially AMD chips)? Do we need to do any special checks before running an xgetbv instruction? — Elliot Gorokhovsky, Jun 06 '22 at 20:35
@ElliotGorokhovsky: Good point. As discussed in [Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?)](https://stackoverflow.com/q/34069054) - use CPUID to check for XGETBV support. — Peter Cordes, Jun 06 '22 at 20:36

Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support?

1 Answers1

Lazy context switches used to be a thing

Linked