3

Set-up: Intel Ivy Bridge Core i7, compiling in 64-bit mode, MSVC(2012) and Win 7 64-bit.

I am trying to understand whether atomic increments causes cache misses.

I set up a test where an atomic variable and another variable were in the same cache line and not in the same cache line and then compared cache misses. Code and results below.

Results

Different cache lines:

  • Atomic increment no L1 cache misses
  • Both the increments of d.a suffered 40-50% L1 cache misses.

Same cache lines

  • Incrementing d.a had no cache misses
  • Incrementing atomic encountered 100% L1 cache misses.

Could someone please explain this?! I was expecting when the atomic was in the same cache line as d.a then d.a would suffer 100% cache misses and when they were in different cache lines d.a would not be affected.

#include <atomic>
#include <iostream>
#include <iomanip>
#include <vector>

//Structure to ensure not in same cache line
__declspec(align(64)) struct S{
  volatile double a,b,d,c,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t;
  volatile std::atomic<short> atom;
};

//Structure to ensure same cache line
/*__declspec(align(64)) struct S{
   volatile std::atomic<short> atom;
   volatile short a;
};*/


int main(){

    volatile S d;

    for(long long i=0; i<1000000000; i++){
        d.a++;
        d.atom++;
        d.a++;
    }
}

UPDATE here is some of the asm:

    /* _Atomic_fetch_add_2, _Atomic_fetch_sub_2 */
inline _Uint2_t _Fetch_add_seq_cst_2(volatile _Uint2_t *_Tgt, _Uint2_t _Value)
    {   /* add _Value to *_Tgt atomically with
 mov         word ptr [_Tgt],dx  
 mov         qword ptr [rsp+8],rcx  
 push        rdi  
            sequentially consistent memory order */

    return (_INTRIN_SEQ_CST(_InterlockedExchangeAdd16)((volatile short *)_Tgt, _Value));
 movzx       eax,word ptr [_Value]  
 mov         rcx,qword ptr [_Tgt]  
 lock xadd   word ptr [rcx],ax  
    }
 pop         rdi  
intrigued_66
  • 16,082
  • 51
  • 118
  • 189
  • 1
    If you compare the atomics / Interlocked APIs and their behaviors on Linux, FreeBSD and Windows and possibly on other OS, you will see that the code you give without mentioning where you run the code at might make it impossible to give a good answer to your question. Also, I wonder how you measure the cache misses. Maybe, it would help to reproduce the code using low level OS specific API. There are choices how atomic increment can be used. On windows, for example NUMA specific variations exist. Mostly InterlockedIncrement/Acquire/Release etc. `#include ` does not show all that. – BitTickler Jun 08 '14 at 04:02
  • Have added more details- using Win 7 and MSVC 11 (2012). Using Intel VTune to measure cache misses. – intrigued_66 Jun 08 '14 at 04:39
  • Stepping inside until you find the "beef", yields that the atomic operator ++ uses _InterlockedExchangeAdd16``. Reading [MSDN: InterlockedExchangeAdd()](http://msdn.microsoft.com/en-us/library/ms683597(v=vs.85).aspx) and hoping that the 16 bit version works accordingly, you chose a "full memory barier" version of the Windows Interlocked API. This comment is intended to save others some time to find out what they are actually looking at. – BitTickler Jun 08 '14 at 05:19
  • Just to rule out the possibility that your assumption on the size of the L1 cache lines does match the content of your structure, you could add some runtime check to your code, using [How to find out cache line size](http://stackoverflow.com/questions/150294/how-to-programmatically-get-the-cpu-cache-page-size-in-c). – BitTickler Jun 08 '14 at 05:37

2 Answers2

1

Looking at this sequence:

for(long long i=0; i<1000000000; i++){
    d.a++;
    d.atom++;
    d.a++;
}

we could rewrite it as (roughly):

for(long long i=0; i<1000000000 / 4; i+=4){
    d.a++;
    d.atom++;
    d.a++;
    d.a++;
    d.atom++;
    d.a++;
    d.a++;
    d.atom++;
    d.a++;
    d.a++;
    d.atom++;
    d.a++;
}

I could go on and expand the loop further, but it becomes obvious that you have two d.a++ in a row after the d.atom++.

In other words, you should expect around 50% cache miss on d.a++, based on the d.a++ at the end of the loop fetching the data for the d.a++ for the next iteration of the loop. Any discrepancy from this would be a measurement error (and cache misses I believe are measured on statistics, not exact steps on exact lines).

In the case where d.a and d.atom are on different cache lines, obviously d.atom++ gets all of the cache-misses accounted on that particular address, explaining the 100% figure in that case.

I'm not 100% sure that the definition of locked operations ("locked operation" = atomic in x86 land) REQUIRE a cache-flush, but it certainly requires an "exclusive access", which means that all other CPU (cores) will need to be told that "you must now flush any copy of this data from your cache". It would appear from your testing that, at least on this model of processor, this equates to "flush all caches for this line", including the one currently holding the data.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • Thanks for your answer. Two questions: Why does the double-consecutive increment of d.a mean a cache miss? Surely if d.a has just been incremented then the value is still in the register and no cache load is required? – intrigued_66 Jun 08 '14 at 14:30
  • Is there any way I could adjust my test to provide a stronger case for the atomic increment flushing cache lines? – intrigued_66 Jun 08 '14 at 14:30
  • Since `d.a` is `volatile`, the compiler is not allowed to load it into register and make more than one update before it's written back to memory, so each `d.a++` will result in a separate memory access to that cache-line. Why do you need a stronger case, I'd say your results are pretty conclusive. – Mats Petersson Jun 08 '14 at 18:56
  • But this is all being done on a single core so I don't see why the cache would be flushed? :s – intrigued_66 Jun 08 '14 at 22:24
  • Could you elaborate on this bit: "In the case where d.a and d.atom are on different cache lines, obviously d.atom++ gets all of the cache-misses accounted on that particular address, explaining the 100% figure in that case." I dont understand why d.atom would get so many cache misses? Unless we are definitely stating the CPU flushes the cache when it makes an atomic update? – intrigued_66 Jun 08 '14 at 22:26
  • Well, that's what your experiment shows. There is no reason why you would get ANY cache misses (well, one miss perhaps for the first iteration of your loop) when you have data that is 16 bytes long and cache-aligned. Note that this just shows that the implementation you are testing does cache-flushes. It is not required by any specification that it does, just that "any update is seen immediately by any other processor". – Mats Petersson Jun 09 '14 at 07:30
  • I still suggest the code to be changed to using InterlockedIncrement instead of InterlockedExchangeAdd. The topic is about InterlockedIncrement flushing caches, after all. And who knows if then, the results will be the same? – BitTickler Jun 09 '14 at 10:45
  • Using std:atomic on a short adds too many unknowns to the equation compared to using naked InterlockedIncrement() on a long. – BitTickler Jun 09 '14 at 10:51
  • @user2225104 surely it would be better to test using inline LOCK XADD assembly? That is the real "meat" doing the atomic increment. The window API functions are just window dressing..... surely? – intrigued_66 Jun 09 '14 at 16:07
  • It is good engineering practice to reduce problems to their minimum. Using std::atomic does not really achieve that. Using overloaded operator++() and hoping it does increment and not atomic addition reduces minimalism. And how long will you have to think about the question: would ++d.a have changed anything compared to d.a++? If it is longer than 0 nanoseconds, you will see my point clearly ;) – BitTickler Jun 11 '14 at 11:01
1

See the SO discussion What is a memory fence?. This is how modern Intel CPUs implement lock instructions. A memory barrier enforces data being written fully to memory and then reread on next access. In the case of [ref to A] [barrier] [ref to A] the CPU cannot do any clever prefetch of the second reference. All barrier instructions operate in same way though some allow you to explicitly limit to a left or right fence mode.

Doing a barrier instruction incurs a performance hit. Trying to understand the character of this hit will vary CPU architecture by architecture, and the internals of the algorthms to do this in H/W is not something that Intel will publish in detail.

Community
  • 1
  • 1
TerryE
  • 10,724
  • 5
  • 26
  • 48
  • Maybe it would be more adequate to say: "Modern CPUs implement lock instruction under consideration of memory barriers" instead of "lock instructions are implemented as memory barriers". This is, why in the WIN32 API set alternatives are offered, such as InterlockedIncrement()/InterlockedIncremeentAcquire()/... As for the OP code, using the compiler intrinsic _InterlockedExchangeAdd16, we can only theorize about the fencing this intrinsic applies. – BitTickler Jun 11 '14 at 11:21
  • I was thinking in terms of the underlying Intel machine code rather than how compilers such as GCC or VSC + WinAPIs expose them at a source level. And in fact the old lock prefixes are now deprecated in favour of the *fence instructions. See [this thread](http://stackoverflow.com/questions/4232660/which-is-a-better-write-barrier-on-x86-lockaddl-or-xchgl) amongst others for more discussion. – TerryE Jun 11 '14 at 15:17
  • It is indeed a confusing topic. There are actually 2 stakeholders in the fencing business: The CPUs (using lock prefix, LFENCE, MFENCE, SFENCE) and the compiler who need to draw on those reordering constraints, too. If you look for example at InterlockedIncrement() - the MSDN docu states, that is creates a full fence. But if you look at generated assembly code, no MFENCE to be seen. Just an lock; xadd instruction (on x86...). – BitTickler Jun 13 '14 at 07:00