What are the threading guarantees of nowadays C and C++ compilers?

Question

I'm wondering what are the guarantees that compilers make to ensure that threaded writes to memory have visible effects in other threads.

I know countless cases in which this is problematic, and I'm sure that if you're interested in answering you know it too, but please focus on the cases I'll be presenting.

More precisely, I am concerned about the circumstances that can lead to threads missing memory updates done by other threads. I don't care (at this point) if the updates are non-atomic or badly synchronized: as long as the concerned threads notice the changes, I'll be happy.

I hope that compilers makes the distinction between two kinds of variable accesses:

Accesses to variables that necessarily have an address;
Accesses to variables that don't necessarily have an address.

For instance, if you take this snippet:

void sleepingbeauty()
{
    int i = 1;
    while (i) sleep(1);
}

Since i is a local, I assume that my compiler can optimize it away, and just let the sleeping beauty fall to eternal slumber.

void onedaymyprincewillcome(int* i);

void sleepingbeauty()
{
    int i = 1;
    onedaymyprincewillcome(&i);
    while (i) sleep(1);
}

Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variable, and generate memory reads to it to ensure that maybe some day the prince will come.

int i = 1;
void sleepingbeauty()
{
    while (i) sleep(1);
}

Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it instead of caching the value.

void sleepingbeauty(int* ptr)
{
    *ptr = 1;
    while (*ptr) sleep(1);
}

I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.

I'm fairly sure that this is the memory access model used by every C and C++ compiler in production out there, but I don't think there are any guarantees. In fact, the C++03 is even blind to the existence of threads, so this question wouldn't even make sense with the standard in mind. I'm not sure about C, though.

Is there some documentation out there that specifies if I'm right or wrong? I know these are muddy waters since these may not be on standards grounds, it seems like an important issue to me.

Besides the compiler generating reads, I'm also worried that the CPU cache could technically retain an outdated value, and that even though my compiler did its best to bring the reads and writes about, the values never synchronise between threads. Can this happen?

You have to pick C or C++, they are different in this respect. — Puppy, Jun 26 '11 at 19:36
@DeadMG, if I didn't pick one, it's because I didn't know of any difference. It would be useful if you explained them. — zneak, Jun 26 '11 at 22:24

Billy ONeal · Answer 1 · 2011-06-22T03:26:43.277

6

Accesses to variables that don't necessarily have an address.

All variables must have addresses (from the language's prospective -- compilers are allowed to avoid giving things addresses if they can, but that's not visible from inside the language). It's a side effect that everything must be "pointerable" that everything has an address -- even the empty class typically has size of at least a char so that a pointer can be created to it.

Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variables, and generate memory reads to it to ensure that maybe some day the prince will come.

That depends on the content of onedaymyprincewillcome. The compiler may inline that function if it wishes and still make no memory reads.

Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it.

Yes, but it really doesn't matter if there are reads to it. These reads might simply be going to cache on your current local CPU core, not actually going all the way back to main memory. You would need something like a memory barrier for this, and no C++ compiler is going to do that for you.

I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.

Nope -- not required. The function may be inlined, which would allow the compiler to completely remove these things if it so desires.

The only language feature in the standard that lets you control things like this w.r.t. threading is volatile, which simply requires that the compiler generate reads. That does not mean the value will be consistent though because of the CPU cache issue -- you need memory barriers for that.

If you need true multithreading correctness, you're going to be using some platform specific library to generate memory barriers and things like that, or you're going to need a C++0x compiler which supports std::atomic, which does make these kinds of requirements on variables explicit.

edited Jun 22 '11 at 03:26

answered Jun 22 '11 at 03:05

Billy ONeal

104,103
58
317
552

1

All variables must have addresses? No! First, being allowed to use address-of doesn't mean you did, and variables which don't have the address taken may very well exist only in a CPU register. Secondly, not everything is "pointerable" (specifically, bitfields aren't). – Ben Voigt Jun 22 '11 at 03:12
@Ben: 1. It's possible at the machine level for things to exist only in a register, yes. But as far as the language is concerned, that never happens. 2. Yes, but a bitfield exists as part of a struct to which a pointer can be made. – Billy ONeal Jun 22 '11 at 03:17
1

@Billy: There's the as-if rule. The compiler only has to assign an address to a variable if the address is taken. If not, the language allows it to not have an address, so long as the as-if rule is satisfied. – Ben Voigt Jun 22 '11 at 03:19
@Ben: Exactly. But the as-if rule is saying "You can do things outside of the spec without telling people so long as you follow what's in the spec". If you avoid putting things in memory, that's fine, but that's not going to be visible to the language itself. (That's kind of the point of the as-if rule -- that you can't tell from inside the language) – Billy ONeal Jun 22 '11 at 03:22
2

@Billy: So what you're really saying is not "all variables have addresses" but "if a variable doesn't have an address, you'll never know the difference"? – Ben Voigt Jun 22 '11 at 03:25
@Ben: Yes, that's right. (Another example of an optimization that takes advantage of this is the Empty Base Class optimization) – Billy ONeal Jun 22 '11 at 03:26
@Ben: I've added a note to the answer to that effect. – Billy ONeal Jun 22 '11 at 03:27
If the function is inlinable and doesn't notify any external agent of the pointer, then we're back to the same situation as case 1 where the sleeping beauty never wakes up. For the purpose of my question, I'd like that we assume it either can't be inlined, or passes the pointer to a global place, or starts another thread and passes it as an argument. – zneak Jun 22 '11 at 03:33
Also, the article I've linked to for almost every other answer says that `volatile` should really be replaced by atomic or synchronization library calls. How does the compiler know that, if I use synchronization functions (which _should_ be enough, right?), it absolutely positively has to generate the reads that make the synchronization functions useful? – zneak Jun 22 '11 at 03:37
@zneak: Because the standard atomics and synchronization primitives use assembly instructions for the various CPUs which one might wish to target. Just as `printf` writes to the console, which is otherwise not exposed to the language. – Billy ONeal Jun 22 '11 at 04:10
As far as current C/C++ compilers are concerned, any external function is written in assembly. There's probably no special case for synchronization functions that tell the compiler it needs to generate reads and writes. (Ben Voigt provided an interesting answer to that in a comment to another answer.) – zneak Jun 22 '11 at 04:19
@Zneak: That's not true (assuming by "external" you mean `extern`). Most any modern compiler nowadays has link-time code generation, which is going to make even functions which don't exist in a given translation unit be optimized as if they were part of that translation unit. As far as special cases for synchronization functions though, you are correct. No special cases need be made by the compiler. – Billy ONeal Jun 22 '11 at 04:22
This is not exactly true either. LTOs don't operate "as if [the optimized functions] were part of that translation unit"; full optimizations are done at a stage at which the compiler has much, much more information than it has when it starts doing LTO. My expectations about LTO (which may not be accurate) are that they'd be able to inline functions, but I'm fairly sure that they don't extend to being able to detect a memory barrier and put back reads in place. – zneak Jun 22 '11 at 04:43
@zneak: Well, there's no need for the compiler to put back reads in place around a memory barrier. It only affects `volatile` variables, and the compiler already can't eliminate reads for those. For the atomics, all the requirements are met inside the library. (Even so, at least on MSVC++, LTCG actually just runs all the optimization stages at link time -- you get everything. The object files in such scenarios are just dumps of the state of the front end, rather than object code. Not sure how GCC does it though.) – Billy ONeal Jun 22 '11 at 04:46
You're right, I got it all wrong for this part and had done a fundamental mistake in my understanding of link-time optimizations. I don't know win32 enough to create a thread to test, but it seems that VS's C++ compiler optimizes away the reads to global variables when they're never modified. Still, from my basic tests, it didn't seem to inline calls that are from a referenced library, so I'd be surprised if it really was able to dig into them altogether to find out about memory barriers. – zneak Jun 22 '11 at 05:21
@Zneak: The thing about LTCG is that it's ineffective unless you specify it both at compile time and at link time. Your libraries probably didn't specify it when they were built. – Billy ONeal Jun 22 '11 at 05:40
Well then; I suppose then that most of the time the compiler can figure if the reads are necessary, but sometimes it can't. Have you ever tried it though? From what you mention, it looks like MSVC++ "makes volatile" all variables for which it can't ensure the 'localness', simply because previous code has a memory barrier (not to mention that memory barriers aren't much related to the compiler-side of the issue). What's the extent of this? – zneak Jun 22 '11 at 13:53
@zneak: It has nothing to do with whether or not there's a memory barrier there. If you call another function, that function can do arbitrary things. You don't know what the global state is afterward. The compiler would have to generate such reads, unless it can inline the callee and prove that the state was unmodified (in which case of course it can eliminate the reads). – Billy ONeal Jun 22 '11 at 14:45
Then I completely missed the point of your comment of yesterday. I asked how does the compiler knows it has to generate reads in functions that use synchronization functions, and you answered "Because the standard atomics and synchronization primitives use assembly instructions for the various CPUs which one might wish to target." My interpretation was that "Compilers can know that there's a memory barrier, and will act accordingly." Can you clarify what you meant? – zneak Jun 22 '11 at 15:31
@zneak: E.g. `std::atomic::operator++` will use `InterlockedIncrement` on Windows platforms. The compiler does not need to know anything about what happens inside the library -- by virtue of the thing being atomic the library contains the memory barriers and stuff inside. There's no special treatment of code outside of the library. If it's not in the atomic, then there are not requirements on the compiler to generate reads if it can prove them redundant. – Billy ONeal Jun 22 '11 at 15:34
There was a misunderstanding then; I'm speaking about reads and writes done without lockless atomic guarantees (and possibly without memory barriers). Suppose [this snippet](http://pastebin.com/GYFMH08f). What analysis will cause a C++03-compliant compiler (which means blind to the existence of threads) to not read `i` once and never read it again, considering `i` is not `volatile`? Ben Voigt says it's because the compiler does not know that `pthread_mutex_lock` cannot cause `i` to change, even assuming a process model where threads don't exist. – zneak Jun 22 '11 at 16:06
1. `pthread_mutex_lock` uses CPU specific instructions to create the memory barrier, which removes the CPU caching issue. 2. `pthread_mutex_lock` cannot be analyzed by the compiler (it's part of the operating system) and therefore the compiler probably cannot remove the memory read of `i`. – Billy ONeal Jun 22 '11 at 16:14

score 2 · Answer 2 · answered Jun 22 '11 at 02:59

2

You assume wrong.

void onedaymyprincewillcome(int* i);

void sleepingbeauty()
{
    int i = 1;
    onedaymyprincewillcome(&i);
    while (i) sleep(1);
}

In this code, your compiler will load i from memory each time through the loop. Why? NOT because it thinks another thread could alter its value, but because it thinks that sleep could modify its value. It has nothing to do with whether or not i has an address or must have an address, and everything to do with the operations that this thread performs which could modify the code.

In particular, it is not guaranteed that assigning to an int is even atomic, although this happens to be true on all platforms we use these days.

Too many things go wrong if you don't use the proper synchronization primitives for your threaded programs. For example,

char *str = 0;
asynch_get_string(&str);
while (!str)
    sleep(1);
puts(str);

This could (and even will, on some platforms) sometimes print out utter garbage and crash the program. It looks safe, but because you are not using the proper synchronization primitives, the change to ptr could be seen by your thread before the change to the memory location it refers to, even though the other thread initializes the string before setting the pointer.

So just don't, don't, don't do this kind of stuff. And no, volatile is not a fix.

Summary: The basic problem is that the compiler only changes what order the instructions go in, and where the load and store operations go. This is not enough to guarantee thread safety in general, because the processor is free to change the order of loads and stores, and the order of loads and stores is not preserved between processors. In order to ensure things happen in the right order, you need memory barriers. You can either write the assembly yourself or you can use a mutex / semaphore / critical section / etc, which does the right thing for you.

answered Jun 22 '11 at 02:59

Dietrich Epp

205,541
37
345
415

I'm not worried about synchronization. I just want to be sure that threads are notified about the changes. – zneak Jun 22 '11 at 03:02
`sleep` cannot modify the local `i` in the first example. The compiler's not going to generate reads for that. Maybe for the global example you'd have a point...., – Billy ONeal Jun 22 '11 at 03:03
Why can't `sleep` modify `i`? It could get the address through a global from `onedaymyprincewillcome`. (Assuming the compiler doesn't have specific knowledge of `sleep` specifically). – Dietrich Epp Jun 22 '11 at 03:05
1

@zneak: Exactly. "You are not worried about synchronization" is essentially the problem which you have, and I'd like to fix *that* problem. – Dietrich Epp Jun 22 '11 at 03:07
2

@Dietrich: Ah ... I missed that loophole. (Read that wrong) However, if onedaymyprincewillcome gets inlined, all bets are off -- nothing requires the compiler to generate reads there, was my main point. (And even if it did generate reads, that still wouldn't solve the problem due to CPU caches) – Billy ONeal Jun 22 '11 at 03:07
@Billy: Yes, that's a good point. I wasn't trying to say that "i must be read by the compiler", but that "if the compiler does read i, it's probably for this reason". So we have to assume that the compiler knows nothing about `sleep` OR `onedaymyprincewillcome`. – Dietrich Epp Jun 22 '11 at 03:08
I'm not worried about synchronization because that's not the issue and I stripped down my question to avoid the noise you're generating. I _know_ that you must synchronize threaded code for accesses. I'm asking this because I want to know why I'm safe, assuming I do use synchronization, without using the `volatile` keyword. – zneak Jun 22 '11 at 03:11
@zneak: The answer is: you're not safe. A future compiler could know that `sleep` doesn't modify any variables, and turn `i` into a register access. – Dietrich Epp Jun 22 '11 at 03:19
Thank you for this straightforward answer. Which case does it address, though? In the first snippet, where `i` is a local with no outside reference, I don't expect the loop to ever exit. Was the comment directed at another one, or to all of them? – zneak Jun 22 '11 at 03:23
1

@zneak: I was addressing this particular loop. One of the problems here is that compilers assume no asynchronous access, but don't reorder escaped loads and stores across arbitrary functions such as `pthread_mutex_lock`. The other problem is that the processor can reorder the loads and stores across function calls, but not across memory barriers (which synchronization primitives use, as necessary). – Dietrich Epp Jun 22 '11 at 13:43

Jason · Answer 3 · 2011-06-22T03:13:51.980

2

While the C++98 and C++03 standards do not dictate a standard memory model that must be used by compilers, C++0x does, and you can read about it here: http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf

In the end, for C++98 and C++03, it's really up to the compiler and the hardware platform. Typically there will not be any memory barrier or fence-operation issued by the compiler for normally written code unless you use a compiler intrinsic or something from your OS's standard library for synchronization. Most mutex/semaphore implementations also include a built-in memory barrier operation to prevent speculative reads and writes across the locking and unlocking operations on the mutex by the CPU, as well as prevent any re-ordering of operations across the same read or write calls by the compiler.

Finally, as Billy points out in the comments, on Intel x86 and x86_64 platforms, any read or write operation in a single byte increment is atomic, as well as a read or write of a register value to any 4-byte aligned memory location on x86 and 4 or 8-byte aligned memory location on x86_64. On other platforms, that may not be the case and you would have to consult the platform's documentation.

edited Jun 22 '11 at 03:13

answered Jun 22 '11 at 03:04

Jason

31,834
7
59
78

Short version: Reads and writes to variables smaller than a pointer are atomic. Everything else is undefined behavior. If you want something defined you have to use `std::atomic`. – Billy ONeal Jun 22 '11 at 03:08
Look on page 5 of the paper you cite, which explains why you aren't guaranteed to be able to do what the poster is asking in C++0x, even with its memory model. Specifically, the poster wants to create an operation where one thread reads `i` and one thread writes to it, possibly at the same time, which is a "data race" by the definition in the slides. The next slide explains that only if you *don't* have a data race (by this definition) does the implementation make promises about interleaving. It may work on your architecture, or it may not. Everyone's spoiled by x86 concurrency. – Dietrich Epp Jun 22 '11 at 03:17
1

Right, the C++0x memory model was not created to avoid data-races if you intentionally decide to create them. I guess the point I was trying to make was that currently there is no standard for memory visibility between threads, and with C++0x there will be, but not that a defined memory model will prevent data-races should you choose to forego the use of the memory barrier constructs in the specification. – Jason Jun 22 '11 at 03:21

score 1 · Answer 4 · answered Jun 22 '11 at 02:50

1

The only control you have over optimisation is volatile.

Compilers make NO gaurantee about concurrent threads accessing the same location at the same time. You will need to some type of locking mechanism.

answered Jun 22 '11 at 02:50

Richard Schneider

34,944
9
57
73

5

Are you sure about that? [Arch Robinson, Threading Building Blocks architect at Intel, says `volatile` is almost useless for threading.](http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/) – zneak Jun 22 '11 at 02:52
Besides, it's not even a concurrency issue. I don't care that my `int` can possibly be inconsistent; I just want to be sure that other threads will _eventually_ see a change. As mentioned in the first paragraph of my question, I'm not concerned about atomicity or synchronisation. I'll make this clearer. – zneak Jun 22 '11 at 02:54
1

@zneak: If you want to write portable C++ code, then no, per the spec `volatile` is basically useless for threading. However, it does depend on the compiler; for example, [Visual C++ provides acquire/release semantics for volatile-qualified things](http://msdn.microsoft.com/en-us/library/12a04hfd.aspx) (it goes above and beyond what is required by the spec). – James McNellis Jun 22 '11 at 02:59
@zneak, that is not relevant to your question. You even mentioned, especially since *"There are no reordering or atomicity issues to address"*. Richard has pointed out, the only way you can guarantee a variables address will exist is with the volatile keyword; just keep in mind this will obviously alleviate particular optimisations by the compiler. – hiddensunset4 Jun 22 '11 at 03:11
1

@zneak: I said volatile can control optimisation NOT help with threading. Volatile will make sure that the compiler does not cache the variable in a register. – Richard Schneider Jun 22 '11 at 03:14
1

@zneak: Just because it isn't very useful doesn't in any way diminish what @Richard said. He didn't say `volatile` is good for threading, just that you don't have anything else. Pre-C++0x, there isn't anything except volatile which has any effect whatsoever. – Ben Voigt Jun 22 '11 at 03:15
@Daniel, what do you mean about "guaranteeing that variables address will exist"? – zneak Jun 22 '11 at 03:16
@Ben Voigt, @Daniel, @Richard; my comments are mostly aimed at the last paragraph, which suggests using locking mechanisms. The article I linked to says that locking mechanisms alone should be enough for my threaded accesses; my question, mostly, is "why is this the case"? – zneak Jun 22 '11 at 03:20
1

@zneak: Because those locking functions have per-platform implementation which includes the necessary memory fences. – Ben Voigt Jun 22 '11 at 03:22
@Ben Voigt; your comment addresses my second concern about values being hidden in caches and memory not being flushed. However, the particular implementation of a synchronization function will not affect how or why the compiler doesn't completely optimize variables away into registers, even though the local function doesn't modify them, and still generates reads for them once in a while. This is the issue `volatile` "fixes", and apparently this fix is not necessary. – zneak Jun 22 '11 at 03:30
@zneak: It depends on whether the compiler's dataflow analysis can prove that can prove that the locking function doesn't modify the variable. For example, a static file-scoped variable whose address is never taken, and is only assigned within a function whose address is never taken. (But that rules out all threading implementations I've seen, which need the address of the thread procedure.) – Ben Voigt Jun 22 '11 at 03:34
@Ben, does that mean synchronization functions have to modify variables to work? For instance, is it bad to use a `pthread_mutex` to control access to a shared structure since the mutex has zero chance of changing the structure? – zneak Jun 22 '11 at 03:40
@zneak: Synchronization functions don't actually have to modify the data, they have to be able to (according to dataflow analysis). The compiler doesn't know whether the `pthread_mutex` functions might call a function pointer previously passed to `pthread_create`, since it can't see into the library code. (If the compiler does have special treatment for pthread functions, it will surely be aware they are intended for synchronization.) – Ben Voigt Jun 22 '11 at 03:43
@Ben, you're being pretty helpful in solving this. Thank you! It would probably be a good idea that you start writing an answer. – zneak Jun 22 '11 at 03:46

score 0 · Answer 5 · edited May 23 '17 at 12:27

0

I can only speak for C and since synchronization is a CPU-implemented functionality a C programmer would need to call a library function for the OS containg an access to the lock (CriticalSection functions in the Windows NT engine) or implement something simpler (such as a spinlock) and access the functionality himself.

volatile is a good property to use at the module level. Sometimes a non-static (public) variable will work too.

local (stack) variables will not be accessible from other threads and should not be.
variables at the module level are good candidates for access by multiple threads but will require synchronizetion functions to work predictably.

Locks are unavoidable but they can be used more or less wisely resulting in a negligible or considerable performance penalty.

I answered a similar question here concerning unsynchronized threads but I think you'll be better off browsing on similar topics to get high-quality answers.

edited May 23 '17 at 12:27

Community

1
1

answered Jun 22 '11 at 20:29

Olof Forshell

3,169
22
28

Why do you say that "`volatile` is a good property to use at the module level"? Which issue does `volatile` address? – zneak Jun 22 '11 at 21:17
At the module level means that it is accessible at least by code in the module (if declared static) or by all code (if it is not). Volatile lets the compiler know that the value in the variable may change at any time. That is to say that if a code sequence reads the values from two or more different places in a function the compiler should not (temporarily) store the value from the first read to replace the following reads, it must read the value at every reference to it. The alternative is that the compiler might (depending on a lot of factors) optimize away all but the first read. – Olof Forshell Jun 23 '11 at 06:24
There is a wide variety of factors that will cause your compiler to not optimize away reads (see other answers); [Arch Robisnon at Intel Threading Building Blocks says it isn't necessary anyways](http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/). – zneak Jun 23 '11 at 15:50
Still, if the value in a variable may change at any time, why risk the compiler removing reads by declaring the variable something else than volatile? – Olof Forshell Jun 23 '11 at 18:27
Making sure the reads happen is of course a priority, however there are other ways to make this happen. This question was all about them. The problem with `volatile` is that it doesn't just ensures that the reads we need are there, it also ensures that _no_ read is removed at all; the same goes for writes. – zneak Jun 23 '11 at 22:45

zneak · Accepted Answer · 2011-06-26T19:26:42.793

I'm writing this answer because most of the help came from comments to questions, and not always from the authors of the answers. I already upvoted the answers that helped me most, and I'm making this a community wiki to not abuse the knowledge of others. (If you want to upvote this answer, consider also upvoting Billy's and Dietrich's answers too: they were the most helpful authors to me.)

There are two problems to address when values written from a thread need to be visible from another thread:

Caching (a value written from a CPU could never make it to another CPU);
Optimizations (a compiler could optimize away the reads to a variable if it feels it can't be changed).

The first one is rather easy. On modern Intel processors, there is a concept of cache coherence, which means changes to a cache propagate to other CPU caches.

Turns out the optimization part isn't too hard either. As soon as the compiler cannot guarantee that a function call cannot change the content of a variable, even in a single-threaded model, it won't optimize the reads away. In my examples, the compiler doesn't know that sleep cannot change i, and this is why reads are issued at every operation. It doesn't need to be sleep though, any function for which the compiler doesn't have the implementation details would do. I suppose that a particularly well-suited function to use would be one that emits a memory barrier.

In the future, it's possible that compilers will have better knowledge of currently impenetrable functions. However, when that time will come, I expect that there will be standard ways to ensure that changes are propagated correctly. (This is coming with C++11 and the std::atomic<T> class. I don't know for C1x.)

score 0 · Answer 7 · answered Jun 25 '11 at 18:43

I'm not sure you understand the basics of the topic you claim to be discussing. Two threads, each starting at exactly the same time and looping one million times each performing an inc on the same variable will NOT result in a final value of two million (two * one million increments). The value will end up somewhere in-between one and two million.

The first increment will cause the value to be read from RAM into the L1 (via first the L3 then the L2) cache of the accessing thread/core. The increment is performed and the new value written initially to L1 for propagation to lower caches. When it reaches L3 (the highest cache common to both cores) the memory location will be invalidated to the other core's caches. This may seem safe but in the meantime the other core has simultaneously performed an increment based on the same initial value in the variable. The invalidation from the write by the first value will be superseeded by the write from the second core invalidating the data in the caches of the first core.

Sounds like a mess? It is! The cores are so fast that what happens in the caches falls way behind: the cores are where the action is. This is why you need explicit locks: to make sure that the new value winds up low enough in the memory hierarchy such that other cores will read the new value and nothing else. Or put another way: slow things down so the caches can catch up with the cores.

A compiler does not "feel." A compiler is rule-based and, if constructed correctly, will optimize to the extent that the rules allow and the compiler writers are able to construct the optimizer. If a variable is volatile and the code is multi-threaded the rules won't allow the compiler to skip a read. Simple as that even though on the face of it it may appear devilishly tricky.

I'll have to repeat myself and say that locks cannot be implemented in a compiler because they are specific to the OS. The generated code will call all functions without knowing if they are empty, contain lock code or will trigger a nuclear explosion. In the same way the code will not be aware of a lock being in progress since the core will insert wait states until the lock request has resulted in the lock being in place. The lock is something that exists in the core and in the mind of the programmer. The code shouldn't (and doesn't!) care.

I never mentioned two threads writing at the same time at the same moment. The question is solely about one thread writing and one thread reading. I know that you need synchronization to write from two threads. — zneak, Jun 25 '11 at 18:49
Okay so have core number two read: it will still take time for the value written by core one to reach a level where it will cause an invalidation in core two's caches. If you don't have a lock core two will assume it has the correct value until its data is invalidated. How much of a problem this latency is determines if you need a lock. — Olof Forshell, Jun 25 '11 at 19:17
I was pretty sure it was not _instant_ either, though. What causes locks to invalidate cached data anyways? — zneak, Jun 25 '11 at 19:28
Locks don't invalidate cached data, writes/updates do. If two cores are using the same data location and one of them writes to it the other core's cached copy of the contents of the location will be flagged as invalid as soon as write the threads its way down through the caches. An update of RAM from a bus-mastering device (disk controller, NIC etc) will also invalidate the RAM locations involved, provided of course that the cores are using them. — Olof Forshell, Jun 26 '11 at 18:21

What are the threading guarantees of nowadays C and C++ compilers?

7 Answers7