0

I'm learning software prefetch instructions.
I want to add fence instructions to serialize prefetch instructions. But the meaning in the manual seems to be that prefetch is not affected by fence.

But it states that CPUID is the way to go. But I have been using CPUID to read processor information, how to use CPUID to serialize prefetch instructions?

A PREFETCHWT1 instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, a PREFETCHWT1 instruction is not ordered with respect to the fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHWT1 instruc- tion is also unordered with respect to CLFLUSH and CLFLUSHOPT instructions, other PREFETCHWT1 instructions, or any other general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR.

A comment on a previous question mentions: How to use CPUID as a serializing instruction?

#include <intrin.h> and use the __cpuid() function.

But the explanation on the Internet says that this is to read CPU information.

https://learn.microsoft.com/en-us/cpp/intrinsics/cpuid-cpuidex?view=msvc-170

Generates the cpuid instruction that is available on x86 and x64. This instruction queries the processor for information about supported features and the CPU type.

How to use CPUID for serialization?

Gerrie
  • 736
  • 3
  • 18
  • What use case? If you're using prefetch for performance, normally you don't want to stop and wait for it, that would entirely defeat the purpose. (Or to stop it from happening any earlier than some point, although that's something you *can* do with lfence; lfence prevents any later instructions from executing, including prefetches, and they can't have an effect until they reach an execution unit. See comments on [how to do mmap for cacheable PCIe BAR](https://stackoverflow.com/posts/comments/131412607)) – Peter Cordes Nov 16 '22 at 08:45
  • As for *how* CPUID serializes the pipeline, it just does. It's one of the guaranteed effects of the microcode implementation, that it drains the ROB and store buffer, and anything else, before it executes, and before any later instructions execute. It's quite slow, normally not a useful part of anything you're doing for performance reasons. – Peter Cordes Nov 16 '22 at 08:46
  • 1
    `cpuid` is a real full serializer instruction. It also serializes the front end, not just the back end. It's unclear to me why Intel would use such an instruction for this purpose. It's pretty useless as a serializer. There's a `serialize` instruction now but I never bother figuring out when it was introduced. You almost never need such a full serializer, one use case was a side channel attack to the uop cache (see: [I See Dead μops: Leaking Secrets via Intel/AMD Micro-Op Caches](https://www.cs.virginia.edu/venkat/papers/isca2021a.pdf), it's very informative). – Margaret Bloom Nov 16 '22 at 08:53
  • thank you! I'm not doing it for performance reasons. I'm doing it for security reasons. But the quote above (Intel manual) seems to imply that it is not affected by lfence. – Gerrie Nov 16 '22 at 08:53
  • I seem to have found a peer, and I'm also doing research on cached channels. – Gerrie Nov 16 '22 at 08:54
  • 1
    @Gerrie `cpuid` can be used both to read information about the CPU and to serialize the instruction stream (from pre-decoding onward, I guess). That's why it can serialize `prefetchwt1` too, it just prevents it from being (pre?)decoded/issued. If you put a `cpuid` before `prefetchwt1` it will make sure the latter is "executed" only after the former. If you don't want any instruction later than the prefetch to be executed before it, you probably need another `cpuid` after the prefetch (sandwiching `prefetchwt1` between `cpuid`). But `cpuid` clobbers most GP registers, it's annoying to use. – Margaret Bloom Nov 16 '22 at 09:01
  • 1
    But you probably don't need a `cpuid` after the prefetch because the execution of its uops probably just mean to "prefetch when possible" so it's async anyway. – Margaret Bloom Nov 16 '22 at 09:03
  • 1
    @MargaretBloom: `lfence` also prevents any later instructions from being execute, so `lfence` ; `prefetcht0` is useful. Prefetch operations aren't affected by fences, so `prefetcht0` ; `lfence` isn't directly useful, but the reverse is (potentially) because out of those two conflicting things (lfence blocks later instructions vs. prefetch not ordered), `lfence` wins because of how CPUs work: prefetch instructions can't do anything until their uop is sent to an execution unit. – Peter Cordes Nov 16 '22 at 10:07
  • @PeterCordes mmm, Intel wrote that `prefetchwt1` is ordered wrt `cpuid` but not `lfence`. Assuming a prefetch cannot do anything before its op reaches an EU, then both `lfence` and `cpuid` would do to order it wrt to earlier instructions. Assuming that the prefetch is done async at some point after it's been executed, `lfence` can do nothing to order the actual execution wrt to later instructions. But *maybe* `cpuid`, according to the quote above, may also wait for pending prefetches thereby ordering them wrt to later instructions. – Margaret Bloom Nov 16 '22 at 11:56
  • @MargaretBloom: Yes, that's what I said, that `lfence`'s spec of not letting later instructions exec "beats" prefetch being unordered, as long as prefetch is implemented as a back-end uop that has to issue and dispatch to exec after it decodes. This might technically just be an implementation detail, and maybe prefetch's wording is supposed to override even `lfence`'s execution guarantee as well as its memory-order guarantees, but it's how real CPUs have to work. And yeah, good question of what it even means for a prefetch to be ordered by serializing. Flush and clear LFBs, too? – Peter Cordes Nov 16 '22 at 12:02

0 Answers0