Is memory reordering visible to other threads on a uniprocessor?

Question

It's common that modern CPU architectures employ performance optimizations that can result in out-of-order execution. In single threaded applications memory reordering may also occur, but it's invisible to programmers as if memory was accessed in program order. And for SMP, memory barriers come to the rescue which are used to enforce some sort of memory ordering.

What I'm not sure, is about multi-threading in a uniprocessor. Consider the following example: When thread 1 runs, the store to f could take place before the store to x. Let's say context switch happens after f is written, and right before x is written. Now thread 2 starts to run, and it ends the loop and print 0, which is undesirable of course.

// Both x, f are initialized w/ 0.
// Thread 1
x = 42;
f = 1;

// Thread 2
while (f == 0)
  ;
print x;

Is the scenario described above possible? Or is there a guarantee that physical memory is committed during thread context switch?

According to this wiki,

When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program execute as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.

Although it didn't explicitly mention uniprocessor multi-threaded applications, it includes this case.

I'm not sure it's correct/complete or not. Note that this may highly depend on the hardware(weak/strong memory model). So you may want to include the hardware you know in the answers. Thanks.

PS. device I/O, etc are not my concern here. And it's a single-core uniprocessor.

Edit: Thanks Nitsan for the reminder, we assume no compiler reordering here(just hardware reordering), and loop in thread 2 is not optimized away..Again, devil is in the details.

If you prohibit (C++) compiler's reordering, do you want know "how the compiled code(=machine code) run on the specific machine architecture"? You should specify target platform(machine arch.) and delete c++ tag. — yohjp, Jan 08 '13 at 13:43
Man, compiler reordering and hardware reordering are two different things. When compiler is settled, hardware could still reorder the instructions/memory accesses at its own will based on a few principles. BTW, different languages may have different memory models, which is the reason I only wanna focus on C++. What do you think? — Eric Z, Jan 08 '13 at 14:28
So isn't hardware instruction reordering a different issue than memory operations (i.e. the need for cache coherence, which isn't part of your scenario)? Even with a strong memory model, you still need memory barriers to prevent instruction reordering. — Chris O, Jan 08 '13 at 14:43
1. Yes, what I care about is memory access reordering, not instruction reordering. 2. No, memory barriers cannot prevent instruction reordering. Instruction can still be reordered as long as their memory commitment is ordered. But mostly you won't care about it as long as the memory access order is guaranteed as what you need. — Eric Z, Jan 08 '13 at 23:24
@EricZ I also agree with you what compiler/hardware reordering is. As you know, new C++11 language has well-defined "Memory Model" for _multithreaded program_ on the _abstract machine_. Nevertheless C++11 compiler & target processor CAN legally _reorder_ any instructions with memory access or _optimize out_ it at compile-time and runtime, as long as that compiler/hardware reordering conform C++11 Memory Model rules, that is C++ "as-if rule". IMO, I'm afraid that ignoring compiler reordering/optimization may make this question meaningless. — yohjp, Jan 09 '13 at 11:29
@yohjp, if you insist, you can always make `f` a volatile so that the loop condition is not optimized away;) Now, the problem(the question which I wanna focus on) is still there, right? — Eric Z, Jan 09 '13 at 12:09
If I understand you correctly you're referring only to memory access reordering performed within the CPU? If so, I don't believe a context switch is possible between instructions which are already inside the CPU pipeline? In that case I believe the pipeline would be flushed and either all instructions would be committed or rolled back, making it impossible to see out-of-order memory access. — skoy, Jan 09 '13 at 14:25
@skoy, yeah, that makes sense. Just cannot find any reference to support it. — Eric Z, Jan 09 '13 at 23:20
@EricZ Practically you're right, but wrong theoretically. It shuold be `atomic` (or `atomic_int`) instead of `volatile int` for inter-thread communication under C++11(and C11) Memory Model. Please refer to [Atomic and volatile in C++11 memory model](http://stackoverflow.com/questions/8819095/), [Is volatile int in C as good as std::atomic of C++0x?](http://stackoverflow.com/questions/6627596/), [Why std::atomic overloads each method with the volatile-qualifier?](http://stackoverflow.com/questions/4870869/) etc. — yohjp, Jan 10 '13 at 00:56

score 20 · Accepted Answer · answered Jan 10 '13 at 01:38

As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.

That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.

[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]

You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.

So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.

On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.

One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.

The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.

And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.

all he asked is whether thread 2 would print out 0. the answer is well defined, NO. I am confused, first you say "In reality that means that it could print something other than 42.", then "On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42". so what exactly is the reality, what the imagination? — , Jan 10 '13 at 19:27
@arrows: note how the on claim is based on a more specific set on assumptions, whereas the other is purely based on the high c++ language level, where this is UB and can cause all of you to get pregnant with hamsters at once. — PlasmaHH, Jan 11 '13 at 15:15
By the rules of C++, it could print anything or do that hamster thing. — JoergB, Jan 11 '13 at 18:56
[Retry, as edit timeout passed] By the rules of C++, it could print anything or do that hamster thing. With a real C++ compiler, it could print 0 (if that is the initial value of x) even on a uniprocessor (with interrupt-driven, preemptive multi-threading) due to compiler reordering. With the assumptions we are told to make (no compiler reordering, uniprocessor) the "... will print 42" statement holds. — JoergB, Jan 11 '13 at 19:03

score 5 · Answer 2 · 2013-01-21T03:55:03.267

A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".

Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".

AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :

writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.

To illustrate the rule, it gives an example similar to this:

memory location x, y, initialized to 0;

thread 1:

mov [x] 1
mov [y] 1

thread 2:

mov r1 [y]
mov r2 [x]

r1 == 1 and r2 == 0 is not allowed

In your example, thread 1 cannot store f before storing x.

@Eric in respond to your comments.

fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];

But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.

Edited to address the claims made on as if this is a C++ question:

Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.

$1.9.14 Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.

+1 that is true. Even better if you could give some example of weak memory model which does swap the STOREs while committing all memory operations before a thread context switch, just like @skoy mentioned above. — Eric Z, Jan 09 '13 at 23:27
Possibly Itanium does this, it is definitely a weak memory model. — Chris O, Jan 10 '13 at 15:21

Useless · Answer 3 · 2013-01-12T22:52:25.010

This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.

Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:

give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
mark it volatile, or
make it an explicitly atomic type and request acquire semantics

On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.

Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:

the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
or the load can peek into the queue and fetch x's value without waiting for the write

Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.

The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.

You're not really reading my question, are you? Please note the comments I made in this thread before you ever post any answer. — Eric Z, Jan 15 '13 at 01:58
I thought I addressed both your (invalid) compiler assumption and your hardware-level reordering question separately. Did you mean something else? — Useless, Jan 15 '13 at 13:15
Just as I said in the comments of my post, I'm not saying that compiler optimization does not exist, and ignoring them subjectively. Instead, I'm trying to get to the point by simplying the problem to focus on hardware reordering only. If you want, you can use various ways(e.g., `volatile`) to ensure compiler optimization doesn't take steps out of your expectation(e.g., optimize away the `for` loop). Your point regarding the hardware reordering makes sense, +1. — Eric Z, Jan 15 '13 at 23:40

score 2 · Answer 4 · answered May 29 '13 at 10:20

You would only need a compiler fence. From the Linux kernel docs on Memory Barriers (link):

SMP memory barriers are reduced to compiler barriers on uniprocessor compiled systems because it is assumed that a CPU will appear to be self-consistent, and will order overlapping accesses correctly with respect to itself.

To expand on that, the reason why synchronization is not required at the hardware level is that:

All threads on a uniprocessor system share the same memory and thus there are no cache-coherency issues (such as propagation latency) that can occur on SMP systems, and
Any out-of-order load/store instructions in the CPU's execution pipeline would either be committed or rolled back in full if the pipeline is flushed due to a preemptive context switch.

score 1 · Answer 5 · answered Jan 06 '13 at 13:20

This code may well never finish(in thread 2) as the compiler can decide to hoist the whole expression out of the loop(this is similar to using an isRunning flag which is not volatile). That said you need to worry about 2 types of re-orderings here: compiler and CPU, both are free to move the stores around. See here: http://preshing.com/20120515/memory-reordering-caught-in-the-act for an example. At this point the code you describe above is at the mercy of compiler, compiler flags, and particular architecture. The wiki quoted is misleading as it may suggest internal re-ordering is not at the mercy of the cpu/compiler which is not the case.

Let's assume compiler reordering is not happening to get focused on hardware reordering. Just edited my post. — Eric Z, Jan 06 '13 at 13:35

score 1 · Answer 6 · answered Jan 06 '13 at 14:46

1

As far as the x86 is concerned, the out-of-order-stores are made consistent from the viewpoint of the executing code with regards to program flow. In this case, "program flow" is just the flow of instructions that a processor executes, not something constrained to a "program running in a thread". All the instructions necessary for context switching, etc. are considered part of this flow so the consistency is maintained across threads.

answered Jan 06 '13 at 14:46

MJZ

1,074
6
12

I do not follow this reasoning because your statements regarding consistency across threads are clearly false on a multi-processor machine. – usr Jan 06 '13 at 18:03

score 0 · Answer 7 · answered Jan 10 '13 at 12:00

A context switch has to store the complete machine state so that it can be restored before the suspended thread resumes execution. The machine states includes the processor registers but not the processor pipeline.

If you assume no compiler reordering, this means that all hardware instructions that are "on-the-fly" have to be completed before a context switch (i.e. an interrupt) takes place, otherwise they get lost and are not stored by the context switch mechanism. This is independend of hardware reordering.

In your example, even if the processor swaps the two hardware instructions "x=42" and "f=1", the instruction pointer is already after the second one, and therefore both instructions must be completed before the context switch begins. if it were not so, since the content of the pipeline and of the cache are not part of the "context", they would be lost.

In other words, if the interrupt that causes the ctx switch happens when the IP register points at the instruction following "f=1", then all instructions before that point must have completed all their effects.

score 0 · Answer 8 · answered Jul 21 '15 at 07:20

From my point of view, processor fetch instructions one by one. In your case, if "f = 1" was speculatively executed before "x = 42", that means both these two instructions are already in the processor's pipeline. The only possible way to schedule current thread out is interrupt. But the processor(at least on X86) will flush the pipeline's instructions before serving the interrupt. So no need to worry about the reordering in a uniprocessor.

Is memory reordering visible to other threads on a uniprocessor?

8 Answers8