Compare-And-Swap not working on many Cores

Question

When I discovered the "CAS" instruction, I remember that I well understood that it could work for threads running on one single CPU but I was surprised that it could for many CPUs

Yesterday, I had my first opportunity to test it on one of my developments. I implemented it and it really worked fine; all my unit-tests was green. Perfect.

But today, I ran my unit-tests on another machine and they are now failing. Less perfect

The main difference on the two machines is that the first one (the one on which the unit-tests are green) is a quit old laptop, with only one core! The second one is more recent i7, and more powerfull...

Now, on my i7, if I force my unit-tests to run on one single core, they become successful. I do this by running

taskset -c <cpu-id> my-unit-test

Legitimately, my original question comes back: is CAS working on many cores? OK, according to what I read, I would be surprised if it didn't...

So what? I hope it comes from a bug in my code. To give you more information, I have a class with a critical section. I added an attribute

bool m_isBeingModified;

It is initialized to false. Moreover, at the beginning of my critical section, I run the function

inline void waitForClassBeingModified()
{
  while (!__sync_bool_compare_and_swap(&m_isBeingModified, false, true))
  {} /// I concider that I can to such a loop as my critical section is very light/short
}

Finally, at the end of my critical section, I reset my boolean variable

 m_isBeingModified = false;

I tried to set my attribute as volatile but it did not change anything: my unit-tests are still failing

Last information:

gcc --version
gcc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Thank you for your help

Are you rolling your own mutex as a learning experience or are you writing production code? If the later, then you might want to reconsider using the synchronization objects that are provided in libraries (like pthread) or by the operating system. — Michael Burr, Mar 30 '17 at 06:56
My code is not for production. It is more a test in order to understand the usage of CAS via __sync_bool_compare_and_swap — Philippe MESMEUR, Mar 30 '17 at 09:10

score 2 · Answer 1 · answered Mar 30 '17 at 13:17

2

Also use __sync_bool_compare_and_swap to unset the variable instead of just m_isBeingModified = false;. Also, don't implement your own mutex...

Both the compiler and the CPU can reorder code in unintended ways. The __sync primitives are marked in such a way to prevent this reordering from happining. Thus, with m_isBeingModified = false; it could very well the case that the compiler would first set the variable to false and only then generate the code for whatever you intended to be inside of the critical region.

answered Mar 30 '17 at 13:17

Uli Schlachter

9,337
1
23
39

Thank you. I am very surprised by the reordering as it also occurs if I declare my attribute as being `volatile`. However, it seems that you are right: if I also use `__sync_bool_compare_and_swap` to unset my variable, my problem disappears. – Philippe MESMEUR Mar 31 '17 at 05:48
What do you mean by "Also, don't implement your own mutex"? Do you mean that I should use the "standard" pthread mutexes or do you have another thing in mind? – Philippe MESMEUR Mar 31 '17 at 05:49
Yes, I mean that you should just use the "standard" pthread mutexes. – Uli Schlachter Mar 31 '17 at 12:19
`volatile` has only a meaning to the compiler. The CPU does not see it at all, so it is still allowed to reorder memory access around this store. Also, `volatile` doesn't have much meaning in the C standard. I don't really know about a situation where it helps. – Uli Schlachter Mar 31 '17 at 12:21

score 0 · Answer 2 · edited May 23 '17 at 10:30

Thanks to Uli's precious help, I think that I have now all the elements to answer to my question.

First of all, I may not be clear until there but the function I want to protect against concurrent access is very light. It takes around 80 cpu cycles to complete (TSC). That's why I prefer to implement my own 'light' concurrent mutex based one CAS than using pthread_mutex.

I found this interesting page that explains how to 'temporarily' disable the code-reordering thanks to the following instruction:

__asm__ __volatile__("":::"memory");

Using it, I really boost my concurrency-protection and, of course all my tests are still successful.

To get a summary, the following list reports the impact on performance of different solutions I tried:

Original code (without protection): around 80 TSC
Double CAS (set & unset variable): around 105 TSC
Mutexes based solution: around 120 TSC
Single CAS + disable reordering: around 85 TSC

Compare-And-Swap not working on many Cores

2 Answers2