Using inline assembly with serialization instructions

Question

We consider that we are using GCC (or GCC-compatible) compiler on a X86_64 architecture, and that eax, ebx, ecx, edx and level are variables (unsigned int or unsigned int*) for input and output of the instruction (like here).

asm("CPUID":::);
asm volatile("CPUID":::);
asm volatile("CPUID":::"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)::"memory");
asm volatile("CPUID":"=a"(eax):"0"(level):"memory");
asm volatile("CPUID"::"a"(level):"memory"); // Not sure of this syntax
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level));

I am not used to the inline assembly syntax, and I am wondering what would be the difference between all these calls, in a context where I just want to use CPUID as a serializing instruction (e.g. nothing will be done with the output of the instruction).
Can some of these calls lead to errors?
Which one(s) of these calls would be the most suited (given that I want the least overhead as possible, but at the same time the "strongest" serialization possible)?

What you want to achieve by serializing ? Your list of inline `asm` is like randomly picked up. All those output looks strange if you don't need the cpuid information. — llllllllll, Jan 30 '18 at 14:09

Peter Cordes · Accepted Answer · 2023-02-20T07:33:00.377

First of all, lfence may be strongly serializing enough for your use-case, e.g. for rdtsc. If you care about performance, check and see if you can find evidence that lfence is strong enough (at least for your use-case). Possibly even using both mfence; lfence might be better than cpuid, if you want to e.g. drain the store buffer before an rdtsc.

But neither lfence nor mfence are serializing on the whole pipeline in the official technical-terminology meaning, which could matter for cross-modifying code - discarding instructions that might have been fetched before some stores from another core became visible.

2. Yes, all the ones that don't tell the compiler that the asm statement writes E[A-D]X are dangerous and will likely cause hard-to-debug weirdness. (i.e. you need to use (dummy) output operands or clobbers).

You need volatile, because you want the asm code to be executed for the side-effect of serialization, not to produce the outputs.

If you don't want to use the CPUID result for anything (e.g. do double duty by serializing and querying something), you should simply list the registers as clobbers, not outputs, so you don't need any C variables to hold the results.

// volatile is already implied because there are no output operands
// but it doesn't hurt to be explicit.

// Serialize and block compile-time reordering of loads/stores across this
asm volatile("CPUID"::: "eax","ebx","ecx","edx", "memory");

// the "eax" clobber covers RAX in x86-64 code, you don't need an #ifdef __i386__

I am wondering what would be the difference between all these calls

First of all, none of these are "calls". They're asm statements, and inline into the function where you use them. CPUID itself is not a "call" either, although I guess you could look at it as calling a microcode function built-in to the CPU. But by that logic, every instruction is a "call", e.g. mul rcx takes inputs in RAX and RCX, and returns in RDX:RAX.

The first three (and the later one with no outputs, just a level input) destroy RAX through RDX without telling the compiler. It will assume that those registers still hold whatever it was keeping in them. They're obviously unusable.

asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory"); (the one without volatile) will optimize away if you don't use any of the outputs. And if you do use them, it can still be hoisted out of loops. A non-volatile asm statement is treated by the optimizer as a pure function with no side effects. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#index-asm-volatile

It has a memory clobber, but (I think) that doesn't stop it from optimizing away, it just means that if / when / where it does run, any variables it could possibly read / write are synced to memory, so memory contents match what the C abstract machine would have at that point. This may exclude locals that haven't had their address taken, though.

asm("" ::: "memory") is very similar to std::atomic_thread_fence(std::memory_order_seq_cst), but note that that asm statement has no outputs, and thus is implicitly volatile. That's why it isn't optimized away, not because of the "memory" clobber itself. A (volatile) asm statement with a memory clobber is a compiler barrier against reordering loads or stores across it.

The optimizer doesn't care at all what's inside the first string literal, only the constraints / clobbers, so asm volatile("anything" ::: register clobbers, "memory") is also a compile-time-only memory barrier. I assume this is what you want, to serialize some memory operations.

"0"(level) is a matching constraint for the first operand (the "=a"). You could equally have written "a"(level), because in this case the compiler doesn't have a choice of which register to select; the output constraint can only be satisfied by eax. You could also have used "+a"(eax) as the output operand, but then you'd have to set eax=level before the asm statement. Matching constraints instead of read-write operands are sometimes necessary for x87 stack stuff; I think that came up once in an SO question. But other than weird stuff like that, the advantage is being able to use different C variables for input and output, or not using a variable at all for the input. (e.g. a literal constant, or an lvalue (expression)).

Anyway, telling the compiler to provide an input will probably result in an extra instruction, e.g. level=0 would result in an xor-zeroing of eax. This would be a waste of an instruction if it didn't already need a zeroed register for anything earlier. Normally xor-zeroing an input would break a dependency on the previous value, but the whole point of CPUID here is that it's serializing, so it has to wait for all previous instructions to finish executing anyway. Making sure eax is ready early is pointless; if you don't care about the outputs, don't even tell the compiler your asm statement takes an input. Compilers make it difficult or impossible to use an undefined / uninitialized value with no overhead; sometimes leaving a C variable uninitialized will result in loading garbage from the stack, or zeroing a register, instead of just using a register without writing it first.

I'm just curious, from the viewpoint of a user-land program, is there anything that only `CPUID` can achieve but not memory order? — llllllllll, Jan 30 '18 at 14:41
@liliscent: Yes, `rdtsc` was the classic example. If you want to make sure `rdtsc` doesn't sample the clock before previous instructions have completed (including ALU instructions), you need a barrier on the *instruction stream*, not data loads/stores. So `cpuid; rdtsc` is a common idiom. However, `lfence;rdtsc` works too, at least on Intel, because `lfence` is implemented as a serializing instruction. With no NT loads / stores in flight, `lfence` is (was) architecturally a no-op, but *micro*architecturally it's serializing. `mfence` may not work, even though it's a stronger mem barrier. — Peter Cordes, Jan 30 '18 at 15:20
@liliscent: Spectre v1 (bounds-check bypass) mitigation is another recent example. So important that it led Intel to document `lfence` as serializing the instruction stream, and thus blocking speculative execution past previous branches. Intel whitepapers and sample code did use `lfence ; rdtsc`, but it was hard to find out *why* that was valid, because [the insn ref manual entry for `lfence`](https://github.com/HJLebbink/asm-dude/wiki/LFENCE) doesn't (currently) mention anything about serializing instructions, only memory operations. — Peter Cordes, Jan 30 '18 at 15:24
See also https://stackoverflow.com/questions/12631856/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc. `mfence` is serializing on AMD (and `lfence` maybe isn't; it has 4 per clock throughput on Bulldozer/Ryzen...) And newer CPUs support `rdtscp` which has a built-in one-way barrier. There's a link in [this old answer I wrote](https://stackoverflow.com/a/39002894/224132) with some older (pre Spectre) details about LFENCE serializing RDTSC. — Peter Cordes, Jan 30 '18 at 15:29
@liliscent: I was trying to figure out if you were asking about the `lfence` *instruction*, or really just the memory-ordering effect of instructions like `lfence` and `mfence` :P It's complicated because of memory-ordering instructions also having serializing effects on some CPUs... (And I'm not 100% sure that `lfence` is as strongly serializing as `cpuid`, or about mfence vs. lfence on AMD vs. Intel; that would probably make a good question.) — Peter Cordes, Jan 30 '18 at 15:44
@llllllllll: Update: a true serializing instruction can matter for stale code-fetch in cross-modifying code. My previous comments were wildly optimistic about `lfence`; it doesn't even wait for the store-buffer to drain. It only serializes instruction execution (drain the ROB before dispatch of any instruction after `lfence`). I updated my answer, linking [Is there a cheaper serializing instruction than cpuid?](https://stackoverflow.com/a/75456027) for more details. — Peter Cordes, Feb 20 '23 at 07:35

Using inline assembly with serialization instructions

1 Answers1

Linked

Related