Cache Coherency vs writing to same element

Question

Following results are produced on Intel (haswell) x86_64. Assuming we have an integer array, int32_t A[32]={0}, consider the following function where func(&A[i]) is being called from two different threads.

void func(int32_t *elem) {
  for (int i = 0; i < 10000; i++) 
    *elem=*elem+1;
}

Assume Thread1 calls func(&A[0]), Consider following three scenarios for calling func on Thread2:

Thread2 calls func(&A[16]): There will be no issues as A[0] and A[16] reside on separate cache-lines, and final results is correct A[0]=10000, A[16]=10000
Thread2 calls func(&A[1]): There is performance hit due false sharing and A[0], A[1] residing on the same cache-line. However final results are correct A[0]=10000,A[1]=10000
Thread2 calls func(&A[0]): we're going to get wrong results due to a race condition, i.e. A[0]<20000

Question: since the cache-lines being addressed in the two threads are the same in scenarios 2 and 3, so why cache coherency protocol like MESI does not protect the results for scenario 3, whereas it's ensuring that results are correct for scenario 2.

Thanks

You need to precise the architecture as this is clearly architecture dependent. I guess your target is x86-64 processors isn't it? I may be also better to precise if you specifically target Intel/AMD/etc. processors and even the generation (things change since decades). The compiler may also be important. — Jérôme Richard, Apr 24 '21 at 22:42
Thanks Jerome, Indeed I'm looking into x86-64 and the tests I did was with Intel processor (Xeon E5-2695 v4), I did not know that generations could affect this as well, thanks for pointing that. I've added this info to the question. Thank you. — AliYekta, Apr 24 '21 at 22:54
Does this answer your question? [Cache Coherence and race condition](https://stackoverflow.com/questions/41199798/cache-coherence-and-race-condition) — Tsyvarev, Apr 26 '21 at 08:05

Renat · Accepted Answer · 2021-04-25T00:09:04.330

Cache coherency protocol doesn't protect for the scenario 3, because *elem=*elem+1 is not atomic (at least it's not atomic in C/C++ as described in this question).

From Wikipedia article cache coherence guarantees:

Reads/Writes to a single memory location must be seen by all processors in the same order.

But this is about individual operations. *elem=*elem+1 contains two operations: read, then write.

Consider 2 threads doing *elem=*elem+1 on different CPUs, and initial value of the elem1 is 0. Because each CPU is doing 2 operations, Read+Write, there might be sequencing as below. Let's mark each input-output operation with a real-world sequence number, so former occurs strictly before the latter:

CPU1                   CPU2  
---------------------------------------------
1) Read elem1 (0)
   Calculate 0+1                
                       2) Read elem1 (0)
                          Calculate 0+1
3) Write to elem1 0+1 
                       4) Write to elem1 0+1

Each cache sees Reads/Writes sequence in the same order as 1-2-3-4 (Read-Read-Write-Write), which is compliant with cache coherence, but there is still a race condition and result is still wrong (elem1==1 at the end).

If write down each step using MESI protocol, it also shows that both caches see Reads/Writes in the same order 1-2-3-4, but result is still wrong:

Thank you Renat, this is really great and putting into perspective the solution to my misconception of cache coherency vs register values being out of date. Thank you — AliYekta, Apr 26 '21 at 08:24

Cache Coherency vs writing to same element

1 Answers1