0

As threads execute on a multi-processor/multi-core machines, they can cause CPU caches to load data from RAM. If threads are supposed to be 'see' same data, it is not guaranteed because thread1 may cause an update in it's CPU's (i.e. where it is currently executing) cache and this change will not be immediately visible to thread2.

To solve this problem, programming languages like Java provide constructs like volatile.

It is clear to me what the problem with multiple threads executing on different CPUs is.

I am pretty sure that a given thread is not bound to a particular CPU for its lifetime and can get scheduled to run on a different CPU. But it is not clear to me why that does not cause problems similar to the one with different threads on different CPUs? After all this thread may have caused an update in one CPU's cache which is yet to be written to RAM. If this thread now gets scheduled on another CPU wouldn't it have access to stale data?

Only possibility I can think, as of now, is that context switching of threads involve writing all the data visible to the thread back to RAM and that when a thread gets scheduled on a CPU, its cache gets refreshed from RAM (to prevent thread seeing stale values).However this looks problematic from performance point of view as time-slicing means threads are getting scheduled all the time.

Can some expert please advise what the real story is?

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
Amit Mittal
  • 2,646
  • 16
  • 24

4 Answers4

3

Caches on modern CPU's are always coherent. So if a store is performed by one CPU, then a subsequent load on a different CPU will see that store. In other words: the cache is the source of truth, memory is just an overflow bucket and could be completely out of sync with reality. So since the caches are coherent, it doesn't matter on which CPU a thread will run.

Also on a single CPU system, the lack of volatile can cause problems due to compiler optimizations. A compiler could for example the hoist a variable out of a loop and then a write made by 1 thread, will never be seen by another thread no matter if it is running on the same CPU.

I would suggest not thinking in term of hardware. If you use Java, make sure you understand the Java Memory Model (JMM). This is an abstract model that prevents thinking in terms of hardware since the JMM needs to run independent of the hardware.

pveentjer
  • 10,545
  • 3
  • 23
  • 40
  • A coherent cache preserves 2 invariants: single writer, multiple reader, and the data value invariant. These 2 combined, make sure that there is a logical point in time where a store becomes visible to all other cores (global visibility). – pveentjer Jun 15 '21 at 02:55
  • Flushing to disk is not part of the question. And modern CPU's don't flush the cache to RAM. – pveentjer Jun 15 '21 at 02:58
  • 1
    The only thing that needs to be taken care of, is that in case of a context switch, there is [StoreLoad] that ensures that the store buffer is drained to the cache so that a process is guaranteed to see its own writes if it gets scheduled on a different CPU. – pveentjer Jun 15 '21 at 02:59
  • I would suggest reading up on the topic. You can download the following book for free and it is one of the best resources I know of on this topic: https://www.morganclaypool.com/doi/abs/10.2200/S00962ED2V01Y201910CAC049 – pveentjer Jun 15 '21 at 03:01
  • I meant flushing to RAM not disk. Quote from the paper: ".. all of the variants make one processor’s write visible to the other processors by propagating the write to all caches ... But protocols differ in when and how the syncing happens.". That's the issue. The whole reason why `volatile` and memory fences are necessary are to ensure the timing between data race conditions. If CPU1 updates location X and then CPU2 reads from location X, CPU2 can see it as the old value or the new value based on the timing of the coherence systems. This is especially true of XC model architectures. – Gray Jun 15 '21 at 16:59
  • A coherence protocol needs to follow certain rules and one of them is that once a store is globally visible, no thread can see an old value. Memory fences have nothing to do with timing; their primary function is to guarantee a specific ordering of loads and stores, executed by a single CPU, on the global memory order. Also there is no data race condition; it is either a data race or a race condition. – pveentjer Jun 15 '21 at 17:10
  • Did you even read the paper you quoted? It specifically gives a data race example and mentions that a memory fence is required to ensure timely updates. Also a search on "data race" gets 312k results and "data race condition" gets 260k results. – Gray Jun 15 '21 at 17:44
  • Yes I have studied the paper and still studying it since there is so much information. I would suggest you read the first 2 or 3 chapters. Cache coherence, sequential consistence and TSO before making any further claims. – pveentjer Jun 15 '21 at 18:36
  • @pveentjer thanks for the answer and helpful comments. Can you please mention about [storeLoad] as well in the answer? By the way, what exactly is store buffer? Is it related to CPU registers? – Amit Mittal Jun 18 '21 at 21:29
  • @Gray thanks for the conversation. It prompted me to do a lot more reading and more intense googling. In none of the text I read though is cache coherency mentioned as asynchronous. In fact to me it will be kind of fruitless to have elaborate coherency protocol if it can be broken so easily (i.e. A cpu can read stale value before the protocol ensures correct value in its cache). Why would CPU vendors incur that cost then? The protocol kicks in as soon as a CPU suffers a cache miss or tries to write to a shared value or tries to read a value changed by some other cpu. – Amit Mittal Jun 18 '21 at 21:40
  • 1
    A store buffer is a buffer to store stores :) The problem with a store is that a store can only be made when the cacheline is in the MODIFIED or EXCLUSIVE state. If the cacheline isn't in the right state, the cacheline needs to be invalidated and the CPU needs to wait for the ACK's of the other cores before the write can be written to the cache. This can take a lot of time and could lead to a serious stall on the CPU. The store buffer helps to hide this latency by queuing stores and let the CPU continue executing instructions. – pveentjer Jun 19 '21 at 01:29
  • So the store buffer sits between the CPU and the cache system. For the CPU the store buffer is invisible; if it does a load, the load will look in the store buffer as well to prevent not seeing the CPUs own store. The problem is that the existence of the store buffer can lead to an older store (in the store buffer) being reordered with a newer load to a different adresss. And as a consequence the CPU isn't sequential consistent. The [StoreLoad] barrier prevents this situation. – pveentjer Jun 19 '21 at 01:32
  • And cache protocol is impossible to break. Caches are always consistent. I'm not sure if Gray understands the topic. – pveentjer Jun 19 '21 at 01:33
  • 1
    The X86 is an ISA that makes use of store buffers (most modern CPU's do.. like ARM). And the memory model the X86 offers is called TSO (Total Store Order). It isn't sequential consistent because of the older store can be reordered with a newer load; but it will provide a total order over all stores because stores in a single CPU will not be reordered ([StoreStore] is automatic) and it has a coherent cache. – pveentjer Jun 19 '21 at 01:36
  • 1
    Thanks @pveentjer. The topic is a lot clearer now. – Amit Mittal Jun 19 '21 at 14:56
2

On a single thread, there is a happens-before relationship between actions that take place, regardless of how the scheduling done. This is enforced by the implementation of the JVM as part of the Java memory model contract promised in the Java Language Specification:

Two actions can be ordered by a happens-before relationship. If one action happens-before another, then the first is visible to and ordered before the second.

If we have two actions x and y, we write hb(x, y) to indicate that x happens-before y.

  • If x and y are actions of the same thread and x comes before y in program order, then hb(x, y).

How exactly this is achieved by the operating system is implementation dependent.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
M A
  • 71,713
  • 13
  • 134
  • 174
  • Thanks for the answer. Why didn't google throw up that page when I was searching like mad :). Doe this mean it is ensured by JVM and unmanaged languages like C++ suffer from this problem? – Amit Mittal Jun 10 '21 at 18:49
  • @AmitMittal No I believe this is managed by the hardware, so should not be a problem for other languages. See https://stackoverflow.com/questions/40256057/data-visibility-on-multi-core-processor-by-single-thread. – M A Jun 10 '21 at 19:02
  • Thanks a lot. That's the exact question I had. Either Google failed me or I am yet to learn how to use Google correctly :) – Amit Mittal Jun 10 '21 at 19:05
  • Sorry I had to 'undo' the acceptance of answer. Answer on the linked question talks about registers getting saved as part of context switch but does not really answer why cache does not cause a problem. – Amit Mittal Jun 10 '21 at 20:06
  • 1
    @AmitMittal Found a better post: https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/. The other question is still good IMO because it also mentions cache coherence. – M A Jun 11 '21 at 06:17
0

For those who might not go through all the comments on the different answers, here is a simplified summary of what I have modelled in my head (please feel free to comment if any point is not correct. I will edit this post)

  1. http://tutorials.jenkov.com/java-concurrency/volatile.html is not accurate and gives rise to questions like this. CPU caches are always coherent. If CPU 1 has written to a memory address X in its cache and CPU 2 reads the same memory address from its own cache (after the fact of CPU 1 writing to its cache), then CPU 2 will read what was written by CPU 1. No special instruction is required to enforce this.
  2. However, modern CPUs also have store buffers. They are used to accumulate writes in the buffer in order to improve performance (so that these writes can be committed to the cache in their own time, making CPU free from waiting for cache coherence protocol to finish).
  3. Whatever is in the store buffer of a CPU is not yet visible to other CPUs.
  4. In addition, CPUs and compilers in order to improve performance are free to re-order instructions as long as it does not change the outcome of the computation (from single thread's point of view)
  5. Also, some compiler optimizations may move a variable completely to CPU registers for a routine, thereby 'hiding' them from shared memory and hence making writes to that variable invisible to other threads.
  6. Points 3,4,5 above are the reason why Java exposes keywords like Volatile. When you use volatile, JVM itself does not re-order instructions if they would break 'happens-before' guarantee. JVM also asks CPU to not re-order by using memory barrier/fence instructions. JVM also does not use any optimization which prevents 'happens-before' guarantee. Overall if a write to a volatile memory has already happened, any read thereafter by another thread will ensure correct value to be available for not only that field but also all fields which were visible to first thread while writing the volatile field.

How does above relate to this question about single thread using different CPUs in its lifetime?

  1. If the thread while executing on a CPU has already written to its cache, nothing more to be considered here. Even if the thread later uses another CPU, it will be able to see its own writes due to cache coherency.
  2. If the thread's writes are waiting in the store buffer and it gets moved out of the CPU, context switching ensures that the thread's writes from store buffer get committed to the cache. After that it is same as point 1.
  3. Any state which is only in CPU registers, anyway gets backed up and restored as part of context switching.

Due to above points, a single thread does not face any problem when it executes over different CPUs during its lifetime.

Amit Mittal
  • 2,646
  • 16
  • 24
0

it is not clear to me why that does not cause problems similar to the one with different threads on different CPUs? After all this thread may have caused an update in one CPU's cache which is yet to be written to RAM. If this thread now gets scheduled on another CPU wouldn't it have access to stale data?

Yes it may have access to stale data but it more likely to have data in its cache that is unhelpful – not relevant to the memory that it needs. First off, the permissions from the OS (if written correctly) won't let one program see the data from another – yes, there are many stories about hardware vulnerabilities in the news these days so I am talking about how it should work. The cache will be clear if another process gets swapped into a CPU.

Whether or not the cache memory is stale or not is a function of the timing of the cache coherence systems of the architectures or whether or not memory fences are crossed.

context switching of threads involves writing all the data visible to the thread back to RAM and that when a thread gets scheduled on a CPU, its cache gets refreshed from RAM (to prevent thread seeing stale values).

That's pretty close to what happens although its cache memory is not refreshed when it gets scheduled. When a thread is contexted switched out of the CPU, all dirty pages of memory are flushed to RAM. When a thread is swapped into a CPU, the cache is either flushed (if from another process) or contains memory that may not be helpful to the incoming thread. This causes a much higher page fault ratio of initial memory accesses meaning that a thread spends longer to access memory until the rows it needs are loaded into the cache.

However this looks problematic from performance point of view as time-slicing means threads are getting scheduled all the time.

Yes there is a performance hit. This highlights why it is so important to properly size your thread-pools. If you have too many threads running CPU intensive tasks, you can cause a loss in performance because of the context switches. If the threads are waiting for IO then increasing the number of threads is a must but if you are just calculating something, using fewer CPUs can result in higher throughput because each CPU can stay in the processor longer and the ratio of cache hits to misses goes up.

Gray
  • 115,027
  • 24
  • 293
  • 354
  • Modern caches and MMU's make use of an address space identifier (ASID) to prevent different processes accessing each others cache entries/TLB entries. So no flushing of cache is needed. – pveentjer Jun 15 '21 at 03:21
  • "Whether or not the cache memory is stale or not is a function of whether or not the memory has been appropriately marked as volatile or if you have memory barriers in place." This is incorrect. Caches are always coherent. The only purpose of a fences is to enforce the order of loads/stores on the global memory order. Because (CPU) caches are coherent, by definition, they can't be stale. – pveentjer Jun 15 '21 at 05:17
  • It is just incorrect to use the phrase "always coherent" @pveentjer. There are coherence protocols in modern architectures but there are timing issues with these protocols. That's the whole reason why memory fences exist. – Gray Jun 15 '21 at 17:09
  • Modern CPU's have coherent caches. I don't know a single mainstream CPU of the last 3 decades that has incoherent caches. Even the DEC Alpha, the king of relaxed memory order, has a coherent cache. The only microprocessor I know of that has an incoherent cache is a GPU. You are wrong about fences; the primary purpose of a fence is to guarantee that that loads and stores, issued by a single CPU, happen in a particular order on the memory order. – pveentjer Jun 15 '21 at 17:15
  • I'm not sure you are understanding my point. They have coherent caches but it's not a _synchronous_ coherency. See: https://www.tutorialspoint.com/parallel_computer_architecture/parallel_computer_architecture_cache_coherence_synchronization.htm – Gray Jun 15 '21 at 17:47