Does the store buffer get flushed during context switches?

Question

When a process is context switched, its state must be stored somewhere so that it can start from the same point later on. The state of a running process would include the data stored in the caches and the store buffer. On x86, cache-coherency protocols will ensure that the state of the program which is stored in caches, becomes visible to all the other cores. However, this is not true for store buffers as they are not coherent like the rest of the memory system by design. A write to the memory will first land up in the store buffer, before it gets drained out to the coherent caches. As a result, when a program is running, some of its state which resides in the store buffer, may not be made immediately visible to all the cores. As a result when this process gets context switched and gets assigned to a different core, it is possible that the new core hasn't seen the writes done by the process on the older core, as they might still be sitting in the store buffer of the older core. Therefore, there is a possibility of loss of state during context switches. That's why I think that the store buffer must be flushed during context switches to make the writes visible to the whole system.

For example, suppose that the process is running the following pseudocode on Core-0:

bool x = false; // Assume that x is NOT register allocated and rather it is sitting somewhere in the memory

// Some random work

x = true;
/*-------------- Context Switched and assigned Core-1 for execution ---------------------*/
if(x)
{
    // Do something
}
else
{
    // Do something else
}

In this case the write done to the variable x (x = true) will be put on the the store buffer of Core-0 (as the program was running on Core-0 initially). Now after the context switch, the program is assigned to Core-1 which may not be aware of the write done to the memory for x = true. Since Core-1 has no idea that the value of x has changed, the program may end up on the else path whereas it should have ended up on the if path (as per the logic of the source code).

Clearly, in this example, if the state of the program pertaining to the value of x is not saved during the context switch, the program may end up executing the branch incorrectly.

However, I have never seen this happen in real life. So I hypothesize that the OS must be somehow ensuring that the program state in the store buffer gets broadcasted to the whole system to ensure correctness. One way to do this is to add a mfence instruction in the context switch code which would ensure that the store buffer gets drained during the context switch. However, while examining the source code of Linux, I was not able to find mfence or any other instruction which would flush the store buffer. So that's why I am asking whether the store buffer gets flushed during context switches or not? And if it is not flushed, how is the OS able to ensure correctness during context switches?

I don't know how you could possibly design a real operating system to move a process from one core to another without any locked operation. Normally the process will be added to a run queue in order to be rescheduled. So the locked operation will serialize outstanding stores. (Nevertheless, even if you could manage to design an OS to do that, I think the regular operation of the store buffers that makes stores visible to all cores in program order would prevent the scenario you're asking about, but I can't explain their operation in sufficient detail to show that.) — prl, Oct 01 '22 at 18:27
Related: comments on [Will the same thread see the latest value after switching onto another core?](https://stackoverflow.com/q/67595683) has some links. Including a duplicate, [What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core?](https://stackoverflow.com/q/60081485). Maybe you missed that Linux uses `lock addl $0, (%rsp)` as `smp_mb()` because it's faster than `mfence` on some CPUs, and not slower on any? — Peter Cordes, Oct 01 '22 at 23:48

Nate Eldredge · Accepted Answer · 2022-10-01T23:30:19.900

Summary: Yes, it is necessary to make sure everything done on Core 0 prior to the switch has been drained from the store buffer before Core 1 begins executing it. However, this will generally be taken care of by normal multiprocessor programming practices in the kernel, and so you will not see it as a special operation that is specific to context switches. On x86 in particular, it does not even require any additional instructions.

Within the kernel, once Core 0 has saved all the state of the process, it needs to somehow notify other cores that it is safe for one of them to begin running it. This will be done by writing to memory in some form or other. For a simple example, let's suppose it is just a boolean flag called ready. Then the kernel code will look something like:

switch_from() {
   proc.state = cpu_state;
   ready = true;
}

switch_to() {
   if (ready) {
      cpu_state = proc.state;
   }
}

What we need is that by the time the store ready = true becomes visible to Core 1, the stores to proc.state and all previous stores that were done on Core 0 are also visible. (This would include the store to x done in user space before the context switch code got control.) In the language of formal memory models, the store to ready done by Core 0 needs to have release semantics. Similarly, the load of ready done by Core 1 needs to have acquire semantics, so that Core 1 does not speculatively execute a load from proc.state or from x before observing ready == true.

On x86, the machine-level memory model is defined such that every store to memory has release semantics (with obscure exceptions for unusual things like non-temporal accesses). So a plain ordinary mov [ready], 1 is sufficient, and you will not see a mfence or any other explicit barrier.

This can be implemented in the hardware by the simple method of having the store buffer be FIFO. We don't actually care that ready = true flushes the store buffer at the exact moment it is executed. It is okay if that happens some time later, so long as everything prior to ready = true in the store buffer is flushed before it. It could be that Core 1 takes longer than we expect to observe ready == true, but when it does, the rest of memory will be observed correctly.

Likewise, every load on x86 is acquire. Again, this is a matter of not letting speculative loads pass each other. (You can still have so-called StoreLoad reordering, if a load is executed speculatively before a store that precedes it in program order, or if the store is still in the store buffer when the load is executed.)

On a machine with weaker memory ordering, you would see some special instructions executed in such cases. For instance, on ARM64, ordinary store and load instructions (str/ldr) do not have memory ordering guarantees, so for instance an ARM64 CPU is allowed to commit store buffer entries independently as their cache lines become available, and does not have to keep them in FIFO order. In this case, the store to ready would need to be done with a special stlr instruction that imposes such order, perhaps by simply draining the store buffer completely before executing. Likewise the load would have to be done with ldar. It could also be done by inserting an explicit dmb memory barrier instruction, but this is slower.

In practice, instead of a simple flag like ready, the proc structure would be protected by some kind of mutex (lock, spinlock, semaphore, etc). An essential feature of any mutex implementation is that its unlock operation must have release semantics (e.g. by doing a release store inside) and its lock operation must have acquire semantics. (Now you see where the terminology comes from.) So the kernel programmer writing the context switch code would simply write

switch_from() {
   proc.state = cpu_state;
   unlock(proc.mutex);
}

switch_to() {
   lock(proc.mutex);
   cpu_state = proc.state;
}

and they would not really have to think explicitly about store buffers or other memory ordering issues.

Kernels need to not break user-space code that uses NT stores, so in practice at least a `lock`ed operation somewhere on the old core is necessary for NT-release. But yes, that will be the case anyway for multi-threaded code. (https://stackoverflow.com/questions/40409297/does-lock-xchg-have-the-same-behavior-as-mfence - on some CPUs a `lock`ed instruction is weaker than than `mfence` wrt. weakly-ordered loads from WC memory, with SSE4.1 `movntdqa`. (This is what a microcode update to Skylake "fixed", at the cost of making it significantly slower.) But otherwise it's equivalent. — Peter Cordes, Oct 01 '22 at 23:46

Does the store buffer get flushed during context switches?

1 Answers1