0

Following results are produced on Intel (haswell) x86_64. Assuming we have an integer array, int32_t A[32]={0}, consider the following function where func(&A[i]) is being called from two different threads.

void func(int32_t *elem) {
  for (int i = 0; i < 10000; i++) 
    *elem=*elem+1;
}

Assume Thread1 calls func(&A[0]), Consider following three scenarios for calling func on Thread2:

  1. Thread2 calls func(&A[16]): There will be no issues as A[0] and A[16] reside on separate cache-lines, and final results is correct A[0]=10000, A[16]=10000

  2. Thread2 calls func(&A[1]): There is performance hit due false sharing and A[0], A[1] residing on the same cache-line. However final results are correct A[0]=10000,A[1]=10000

  3. Thread2 calls func(&A[0]): we're going to get wrong results due to a race condition, i.e. A[0]<20000

Question: since the cache-lines being addressed in the two threads are the same in scenarios 2 and 3, so why cache coherency protocol like MESI does not protect the results for scenario 3, whereas it's ensuring that results are correct for scenario 2.

Thanks

AliYekta
  • 13
  • 2
  • You need to precise the architecture as this is clearly architecture dependent. I guess your target is x86-64 processors isn't it? I may be also better to precise if you specifically target Intel/AMD/etc. processors and even the generation (things change since decades). The compiler may also be important. – Jérôme Richard Apr 24 '21 at 22:42
  • Thanks Jerome, Indeed I'm looking into x86-64 and the tests I did was with Intel processor (Xeon E5-2695 v4), I did not know that generations could affect this as well, thanks for pointing that. I've added this info to the question. Thank you. – AliYekta Apr 24 '21 at 22:54
  • Does this answer your question? [Cache Coherence and race condition](https://stackoverflow.com/questions/41199798/cache-coherence-and-race-condition) – Tsyvarev Apr 26 '21 at 08:05

1 Answers1

2

Cache coherency protocol doesn't protect for the scenario 3, because *elem=*elem+1 is not atomic (at least it's not atomic in C/C++ as described in this question).

From Wikipedia article cache coherence guarantees:

Reads/Writes to a single memory location must be seen by all processors in the same order.

But this is about individual operations. *elem=*elem+1 contains two operations: read, then write.

Consider 2 threads doing *elem=*elem+1 on different CPUs, and initial value of the elem1 is 0. Because each CPU is doing 2 operations, Read+Write, there might be sequencing as below. Let's mark each input-output operation with a real-world sequence number, so former occurs strictly before the latter:

CPU1                   CPU2  
---------------------------------------------
1) Read elem1 (0)
   Calculate 0+1                
                       2) Read elem1 (0)
                          Calculate 0+1
3) Write to elem1 0+1 
                       4) Write to elem1 0+1

Each cache sees Reads/Writes sequence in the same order as 1-2-3-4 (Read-Read-Write-Write), which is compliant with cache coherence, but there is still a race condition and result is still wrong (elem1==1 at the end).

If write down each step using MESI protocol, it also shows that both caches see Reads/Writes in the same order 1-2-3-4, but result is still wrong:

enter image description here

Renat
  • 7,718
  • 2
  • 20
  • 34
  • 1
    Thank you Renat, this is really great and putting into perspective the solution to my misconception of cache coherency vs register values being out of date. Thank you – AliYekta Apr 26 '21 at 08:24