Memory Protection Keys Memory Reordering

Question

Reading Intel's SDM about Memory protection keys (MPK) doesn't suggest wrpkru instruction as being a serializing, or enforcing memory ordering implicitly.

First, it is surprising if it is not enforcing some sort of ordering, as one would suspect the programmer doesn't want memory accesses around a wrpkru to be executed out of order.

Second, does that mean wrpkru needs to be surrounded by lfence?

Linux and glibc don't use any sort of fence after the write. But shouldn't that be included in the SDM?

You might want to look at the source code to `pkey_mprotect()`, to see if they did anything to protect it. — Barmar, Jul 24 '18 at 20:49
@Barmar `pkey_mprotect` doesn't use `wrpkru`, it modifies the pte to set the pkey number and then invalidates/shootsdown the TLB entry if cached. — Mohammad Hedayati, Jul 24 '18 at 20:54
I misread [this page](https://www.kernel.org/doc/Documentation/x86/protection-keys.txt). It says `pkey_set()` is the wrapper for `wrpkru`. — Barmar, Jul 24 '18 at 20:56
@Barmar I edited the question. `pkey_set()` doesn't have a fence. — Mohammad Hedayati, Jul 24 '18 at 21:01
As near any one can tell LFENCE doesn't do anything: https://stackoverflow.com/questions/20316124/does-it-make-any-sense-to-use-the-lfence-instruction-on-x86-x86-64-processors — Ross Ridge, Jul 24 '18 at 21:45
@RossRidge: I assumed the OP was suggesting LFENCE to block out-of-order execution in the CPU pipeline, which is a guaranteed (on Intel only) effect of `lfence`: ([Are loads and stores the only instructions that gets reordered?](https://stackoverflow.com/q/50494658)), not for its (non-existent) memory-ordering effect. (i.e. only relevant for for MOVNTDQA loads from WC memory reordering with other loads, which nobody cares about.) — Peter Cordes, Jul 24 '18 at 22:09
@PeterCordes Well, the question does mention memory ordering a few times. The problem here, as your answer explains, is that memory ordering is a red herring. — Ross Ridge, Jul 24 '18 at 22:41

score 2 · Accepted Answer · answered Jul 24 '18 at 21:36

I'd assume that the CPU preserves the illusion of running a single thread in program order, as always. That's the cardinal rule of out-of-order execution. Accesses before wrpkru are done with the old PKRU, accesses after are done with the new PKRU.

Just like how modifying the MXCSR affects later FP instructions but not earlier instructions, or modifying a segment register affects later but not earlier loads/stores.

It's up to the implementation whether it wants to rename the PKRU, the MXCSR, or segment registers. If it doesn't rename the PKRU, then it has to complete all pending loads/stores before changing the PKRU and allowing later loads/stores to execute. (i.e. the microcode for wrpkru could include the uops for lfence if that's how it's implemented.)

All memory accesses have a dependency on the last wrpkru instruction, and the last write to the relevant segment register, and the last write to cr3 (the top-level page table), and the last change of privilege level (syscall / iret / whatever). Also on the last store to that location, and you never need a fence to see your own most recent stores. It's up to the CPU architects to build hardware that runs fast while preserving the illusion of program order.

e.g. Intel CPUs since at least Core2 have renamed the x87 FP control word, so old binaries that implement (int)fp_var by changing the x87 rounding mode to truncate and then back to nearest don't serialize the FPU. Some CPUs do rename segment registers according to Agner Fog's testing, but my testing shows that Skylake doesn't: Is a mov to a segmentation register slower than a mov to a general purpose register?.

I'm not familiar with MPK, but why would it be a problem for memory accesses to happen out of order as long as they all use the correct PKRU value, and they don't violate any of x86's normal memory-ordering rules?

(only StoreLoad reordering is allowed to be visible by other threads. Internally a CPU can execute loads earlier than they're "supposed to", but verify that the cache line wasn't invalidated before the point where it was architecturally allowed to load. This is what the Memory Order Buffer does.)

In C/C++, of course you need some kind of barrier against compile-time reordering of accesses around the wrapper function. Normally a non-inline function call is sufficient, like for pthread_mutex_lock(). How does a mutex lock and unlock functions prevents CPU reordering?.

The earlier part of this answer is about ordering in assembly.

Intel implemented `WRPKRU` in a way to prevent data loads and stores from being reordered around it (like `MFENCE`). However, due to the Spectre 1.1 vulnerability (speculative buffer overflow), it turns it is necessary for `WRPKRU` to perform `LFENCE`, not just `MFENCE`. But AFAIK, Intel has not released a patch for it yet. So the OP is right that it needs have the same serializing properties as `LFENCE`. But unfortunately it does not work that way on current processors. — Hadi Brais, Jul 24 '18 at 23:25

Memory Protection Keys Memory Reordering

1 Answers1