I am trying to understand the nature of atomic add operation. So, I am running the following code in a Broadwell machine.
int main(int argc, char ** argv){
int nThreads = -1;
float shareFrac = -1;
uint64_t nIter = -1;
ParseArg(argc, argv, nThreads, shareFrac, nIter);
atomic<uint64_t> justToAvoidCompilerOptimization;
#pragma omp parallel num_threads(nThreads)
{
int me = omp_get_thread_num();
atomic<uint64_t> *tsData = &trueSharingData.data[0];
atomic<uint64_t> *privateData = &(new SharedData_t())->data[0];
for(uint64_t i = 0 ; i < nIter; i++) {
// Use RDTSC as a proxy random number generator
unsigned long lo, hi;
asm volatile( "rdtsc" : "=a" (lo), "=d" (hi) );
int rNum = (lo % 54121) % 100; // mod by a prime.
// if the random number is < shareFrac, perform a shared memory operation
if (rNum < shareFrac) {
*tsData += rNum2;
} else {
*privateData += rNum;
}
}
justToAvoidCompilerOptimization += *tsData;
justToAvoidCompilerOptimization += *privateData;
}
return justToAvoidCompilerOptimization.load() ^ justToAvoidCompilerOptimization.load();
}
In this code, basically each thread performs atomic add operation nIter
number of times with nIter
being the loop trip count. In each loop iteration, the atomic add operation might be performed on either a shared memory location or a thread local variable.
The fraction of loop trip count spent for performing atomic add operations on shared memory location is determined by a parameter shareFrac
. For example, if shareFrac
is 0.3 and nIter
is 1000, then it is expected that atomic add is performed on shared memory location approximately 300 times.
So, I performed a little experiment where I ran this simple code a number of times with increasing shareFrac
values. For each run, I counted the occurrences of L2_RQSTS.RFO_MISS events by using perf. I also compare the counts given by perf with the expected counts. The expected count is simply nthreads * nIter * shareFrac
.
The results are as follow.
nThreads = 2, nIter = 100 millions
nThreads = 8, nIter = 100 millions
As can be seen in the figures, RFO miss counts exceed the expected counts in most of the runs. How can this be possible?? A possible explanation is that an atomic add brings a line with RFO hoping to read-and-then-update. However, the line can be stolen in between read and write, in which case, the line must be brought back. But, to the best of my knowledge, for atomic operations on x86, the cacheline is locked, and hence, the cacheline must not be stolen once it is brought with an exclusive permission. Or is my understanding incorrect?
To eliminate the possibility of cacheline transfer due to prefetching, I also eliminated h/w prefetchers on all cores of the machines before getting those results.