InterlockedExchange and memory visibility

Question

I have read the article Synchronization and Multiprocessor Issues and I have a question about InterlockedCompareExchange and InterlockedExchange. The question is actually about the last example in the article. They have two variables iValue and fValueHasBeenComputed and in CacheComputedValue() they modify each of them using InterlockedExchange:

InterlockedExchange ((LONG*)&iValue, (LONG)ComputeValue());  // don't understand
InterlockedExchange ((LONG*)&fValueHasBeenComputed, TRUE); // understand

I understand that I can use InterlockedExchange for modifing iValue but is it enought just to do

iValue = ComputeValue();

So is it actually necessary to use InterlockedExchange to set iValue? Or other threads will see iValue correctly even if iValue = ComputeValue();. I mean the other threads will see iValue correctly because there is InterlockedExchange after it.

There is also the paper A Principle-Based Sequential Memory Model for Microsoft Native Code Platforms. There is the 3.1.1 example with more or less the same code. One of the recomendation Make y interlocked. Notice - not both y and x.

Update
Just to clarify the question. The issue is that I see a contradiction. The example from "Synchronization and Multiprocessor Issues" uses two InterlockedExchange. On the contrary, in the example 3.1.1 "Basic Reodering" (which I think is quite similar to the first example) Herb Sutter gives this recomendation

"Make y interlocked: If y is interlocked, then there is no race on y because it is atomically updatable,and there is no race on x because a -> b -> d."

. In this draft Herb do not use two interlocked variable (If I am right he means use InterlockedExchange only for y ).

I'd say you're right, only accesses to `fValueHasBeenComputed` need to be interlocked. Why did they make it like this, I don't know. — jpalecek, Oct 07 '11 at 09:58
Yeah, it's necessary. InterlockedExchange promises a full memory barrier so you can be sure that both variables are visible. But it does *not* guarantee in which *order* the pending buffer writes become visible. You are still screwed if the fValueHasBeenComputed update is flushed first. — Hans Passant, Oct 07 '11 at 10:06
@HansPassant: Wrong. Quoting the linked document: A full memory barrier ensures that **memory read and write operations that appear before the memory barrier instruction are committed to memory before any memory read and write operations that appear after** the memory barrier instruction. In fact, the barriers are ther to *ensure ordering*, not visibility. — jpalecek, Oct 07 '11 at 10:41
@jpalecek you are not protected in case when other instructions occur in another thread between these two — Oleg, Oct 07 '11 at 11:00
@Oleg: OMG, protected against what? What do "instructions in another thread" have to do with this? — jpalecek, Oct 07 '11 at 11:01
@Oleg: writes to well-aligned memory locations are atomic on x86. Even without interlocked exchange, `iValue` would be updated atomically (unless your code does something horrible to break alignment). So instructions in another thread have nothing to do with it. Morover, even when both variables are updated with interlocked operations, there's still an interval in between them where instructions in another thread will see the nerw value of `iValue` and the old value of `fValueHasBeenComputed`. So the race condition is still there with interlocked. — jalf, Oct 07 '11 at 11:30
@Oleg: Yes. The code (in either modification) does **nothing** against races between two calls of `CacheComputedValue()`. However, races between `Cache...` and `Fetch...` are equally impossible in both. Arguing for `InterlockedExchange` (especially if you don't use its return value) on grounds of race condition is lame. — jpalecek, Oct 07 '11 at 11:36
@jpalecek I believe jalf explained this once again. Did you get it now? Order of operations is guaranteed, you are not guaranteed that operations are executed immediately one after another. — Oleg, Oct 07 '11 at 13:11
@Oleg: yes, I have. Rather than asking if we have heard of a seemingly endless stream of unrelated words, perhaps you could explain what it is about them that you consider relevant. — jalf, Oct 07 '11 at 13:15
@Oleg: you are never guaranteed that two operations are executed immediately after another (depends on what you mean by *immediately*, obviously). `Interlocked...` doesn't solve this either. — jpalecek, Oct 07 '11 at 13:17
No matter how many CPUs or threads you have, writing a `LONG` on x86 (using Win32's definition of `LONG`) will (assuming your data is well aligned) be atomic. And no matter how many interlocked operations you use, there will be a gap between `iValue` and `fValueHasBeenComputed` getting updated, so there will be a potential race condition there if other code assumes that the two variables will be updated as a single atomic operation. — jalf, Oct 07 '11 at 13:27
@jpalecek. Ok, just imagine that another piece of code checks fValueHasBeenComputed and gets false even if iValue was updated. This is obviously not solved by InterlockedExchange, but in my initial comment I only did want to say that Hans Passant was correct about this. — Oleg, Oct 07 '11 at 13:51

score 1 · Answer 1 · answered Oct 07 '11 at 11:10

1

They did that to prevent partial reads/writes if the address of iValue is not aligned to an address that guarantees atomic access. this problem would arise when two or more physical thread try to write the value concurrently, or one reads and one tries to write at the same time.

As a secondary point, it should be noted that stores are not always globally visible, they are only going to be visible when serialized, either by a fence or by a bus lock.

answered Oct 07 '11 at 11:10

Necrolis

25,836
3
63
101

How is this particular example `iValue` can be read partially modifed even if `iValue` is not aligned to an address that guarantees atomic access? I mean that `FetchComputedValue()` reads `iValue` only after `fValueHasBeenComputed` has been set to `TRUE`. And at that moment `iValue` must be fully modifed even if its modification was not atomic. – Oct 14 '11 at 08:33
@skwllsp: although that may be partially true for x86, what happens if the thread is interrupted between the check and the read? other threads are still free to modify the value, so when the interrupted thread receives control again, it can still read a partial value. the same applies to instructions, just cause read2 directly follows read1 doesn't mean there cant be a write from a separate HW thread in-between. such is the perils of multi-core/threaded programming, you can **never** assume things, this is why early lockless algo's fell prey to the ABA problem – Necrolis Oct 14 '11 at 09:14
In the Microdoft example there are no other threads. So no other thread can change the two variables. – Oct 14 '11 at 10:03
I have found a discussion about the same topic here: http://social.msdn.microsoft.com/Forums/pl-PL/parallelcppnative/thread/ef2c50f1-c5e8-4dfa-9e58-f2972a5e9f08. The summary: "On Windows platforms, InterlockedXxx functions have full fence semantics". So it means that only one `InterlockedExchange` is necessary (the second one). It guarantees that all modifications before `InterlockedExchange` correctly seen by other threads. – Oct 14 '11 at 10:08
@skwllsp: the MS example is meant to apply to systems with two or *more* threads of execution. yes, the barrier from an interlocked stops reordering and serializes all writes (as I mentioned in my answer, they use interlocked, aka bus locks to globalize writes), but it still does not prevent an interruption between the two writes, and you need both interlocked to make the writes atomic & visible without MFENCE, as they plainly state in the article, under 'Fixing a Race Condition'. That other article also plain states that not every system fences interlocks implicitly. – Necrolis Oct 14 '11 at 10:36
I think I don't understand your point about `Interlocked` without MFENCE. Why are you mentioning "without MFENCE"? In my situation I use Windows on x86/x64 and Microsoft here http://msdn.microsoft.com/en-us/library/windows/desktop/ms683590%28v=vs.85%29.aspx says this function generates a full memory barrier (or fence) to ensure that memory operations are completed in order.". – Oct 14 '11 at 19:49

score 0 · Answer 2 · edited May 23 '17 at 12:27

0

You simply get an atomic operation with InterlockedExchange. Why you need it? Cause InterlockedExchange does 2 things.

Replaces a value of variable
Returns an old value

If you do the same things in 2 operations (Thus first check value then replace) you can get screwed if other instructions (on another thread) occur between these 2.

And you also prevent data races on this value. here you get a good explanation why read/write on a LONG is not atomic

edited May 23 '17 at 12:27

Community

1
1

answered Oct 07 '11 at 11:10

Oleg

798
3
7

All the `InterlockedExchange` examples in the OP's code do is #1. The returned value is discarded – jalf Oct 07 '11 at 11:38
@jalf Did you take a look at the link? – Oleg Oct 07 '11 at 13:08
you've got a strange way of discussing. Yes, I have taken a look at the link and yes, I have heard of race conditions and yes, I have heard of multicore machines. SO WHAT? Neither of those are relevant. On x86 machines, memory writes of aligned variables (at least if they're no wider than 32 bits) are atomic. And `LONG` in the Win32 API is 32 bits wide. Therefore, writing an object of type `LONG` to a well-aligned address is going to be atomic. `InterlockedExchange` also gives us access to the old value, *but the OP doesn't use that*. – jalf Oct 07 '11 at 13:20
@jalf Hmm, maybe I was not polite. Sorry, just a hard day. Memory writes are atomic in the way that you do not get a half-changed variable in any case. On the other hand you are not guaranteed that your updated variable is immediately synchronized back from CPU cache, so if you have a multicore system you can reach a situation when your value was updated by one core but value stored in another core's cache (and in memory) is different. So you have to use a memory barrier (fence) to verify the value was flushed before next access and this is basically what atomic operation implementation does. – Oleg Oct 07 '11 at 14:27
1

Oleg is right. Relying on proper alignment, the size of a defined type, and the behavior of the underlying hardware is absolutely terrible code that will someday be somebody's debugging headache. +1 – Carey Gregory Oct 07 '11 at 17:39
@Carey: It is impossible to write code without depending on those things. It's pretty fundamental of multithreaded code that you are going to rely on the properties of the underlying hardware. Unless you propose to put a memory barrier between *every* line of code, that is. – jalf Oct 07 '11 at 19:46
1

@Oleg: most common CPU architectures (x86 included) has cache coherence between cores, so you *will* see the value updated on all cores or not at all. And looking at the code, the first value is not intended to be used until the second one has *also* been updated. A single memory barrier will flush both writes, and there is no need for the first one to be flushed before the second has been updated – jalf Oct 07 '11 at 19:46
2

@jalf: I thoroughly disagree. Unless you're writing device drivers you shouldn't be assuming much of anything about hardware, and you certainly shouldn't be relying on a whole set of assumptions that may or may not be true with another platform, compiler, data item size, or OS. In a multithreaded app, operations on shared objects that need to be atomic require mutex protection from either the OS or compiler. Relying on quirks of the hardware is very poor practice. – Carey Gregory Oct 07 '11 at 21:33
1

@jalf, I think the issue is what happens if your code is recompiled in ten years on a different architecture. If you've used InterlockedExchange, it will still work. If not, all bets are off. – Harry Johnston Oct 07 '11 at 23:38
@Harry, what happens if your code is recompiled in ten years on a different architecture is not an issue. The issue is that I see a contradiction. The example from "Synchronization and Multiprocessor Issues" uses two `InterlockedExchange`. On the contrary, in the example 3.1.1 "Basic Reodering" (which I think is quite similar to the first example) Herb Sutter gives this recomendation "Make y interlocked: If y is interlocked, then there is no race on y because it is atomically updatable,and there is no race on x because a -> b -> d.". In this draft Herb do not use two `interlocked`! – Oct 08 '11 at 07:37

score 0 · Answer 3 · answered Oct 08 '11 at 22:44

There are two plausible resolutions to the contradiction you've observed.

One is that the second document is simply wrong in that particular respect. It is, after all, a draft. I note that the example you refer to specifically states that the programmer cannot rely on the writes to be atomic, which means that both writes must indeed be interlocked.

The other is that the additional interlock might not actually be required in that particular example, because it is a very special case: only a single bit of the variable is being changed. However, the specification being developed doesn't appear to mention this as a premise, so I doubt that this is intentional.

score 0 · Answer 4 · answered Oct 19 '11 at 10:24

I think this discussion has the answer to the question: Implicit Memory Barriers.

Question: does calling InterlockedExchange (implicit full fence) on T1 and T2, gurentess that T2 will "See" the write done by T1 before the fence? (A, B and C variables), even though those variables are not plance on the same cache-line as Foo and Bar ?

Answer: Yes -- the full fence generated by the InterlockedExchange will guarantee that the writes to A, B, and C are not reordered past the fence implicit in the InterlockedExchange call. This is the point of memory barrier semantics. They do not need to be on the same cache line.

Memory Barriers: a Hardware View for Software Hackers and Lockless Programming Considerations for Xbox 360 and Microsoft Windows are also insteresting.

InterlockedExchange and memory visibility

4 Answers4