This question is the textbook example of what makes concurrent programming difficult. A really thorough explanation could fill an entire book, as well as lots of articles of varying quality.
But we can summarize a little. A global variable is in a memory space visible to all the threads. (The alternative is thread-local storage, which only one thread can see.) So you would expect that if you have a global variable G, and thread A writes value x to it, then thread B will see x when it reads that variable later on. And in general, that is true -- eventually. The interesting parts are what happens before "eventually".
The biggest source of trickiness are memory consistency and memory coherence.
Coherence describes what happens when thread A writes to G and thread B tries to read it at nearly the same moment. Imagine that thread A and B are on different processors (let's also call them A and B for simplicity). When A writes to a variable, there is a lot of circuitry between it and the memory that thread B sees. First, A will probably write to its own data cache. It will store that value for a while before writing it back to main memory. Flushing the cache to main memory also takes time: there's a number of signals that have to go back and forth on wires and capacitors and transistors, and a complicated conversation between the cache and the main memory unit. Meanwhile, B has its own cache. When changes occur to main memory, B may not see them right away — at least, not until it refills its cache from that line. And so on. All in all, it may be many microseconds before thread A's change is visible to B.
Consistency describes what happens when A writes to variable G and then variable H. If it reads back those variables, it will see the writes happening in that order. But thread B may see them in a different order, depending on whether H gets flushed from cache back to main RAM first. And what happens if both A and B write to G at the same time (by the wall clock), and then try to read back from it? Which value will they see?
Coherence and consistency are enforced on many processors with memory barrier operations. For example, the PowerPC has a sync opcode, which says "guarantee that any writes that have been made by any thread to main memory, will be visible by any read after this sync operation." (basically it does this by rechecking every cache line against main RAM.) The Intel architecture does this automatically to some extent if you warn it ahead of time that "this operation touches synchronized memory".
Then you have the issue of compiler reordering. This is where the code
int foo( int *e, int *f, int *g, int *h)
{
*e = *g;
*f = *h;
// <-- another thread could theoretically write to g and h here
return *g + *h ;
}
can be internally converted by the compiler into something more like
int bar( int *e, int *f, int *g, int *h)
{
int b = *h;
int a = *g;
*f = b ;
int result = a + b;
*e = a ;
return result;
}
which could give you a completely different result if another thread performed a write at the location given above! also, notice how the writes occur in a different order in bar
. This is the problem that volatile is supposed to solve -- it prevents the compiler from storing the value of *g
in a local, but instead forces it to reload that value from memory every time it sees *g
.
As you can see, this is inadequate for enforcing memory coherence and consistency across many processors. It was really invented for cases where you had one processor that was trying to read from memory-mapped hardware -- like a serial port, where you want to look at a location in memory every n microseconds to see what value is currently on the wire. (That is really how I/O worked back when they invented C.)
What to do about this? Well, like I said, there are whole books on the subject. But the short answer is that you probably want to use the facilities your operating system / runtime platform provide for synchronized memory.
For example, Windows provides the interlocked memory access API to give you a clear way of communicating memory between threads A and B. GCC tries to expose some similar functions. Intel's threading building blocks give you a nice interface for x86/x64 platforms, and the C++11 thread support library provides some facilities also.