I think you need to take a step back and first define all your use cases. The merits of __sync vs C11 atomics aside, better to define your needs first (i.e. __sync/atomics are solutions not needs).
The Linux kernel is one of the heaviest, most sophisticated users of locking, atomics, etc. and C11 atomics aren't powerful enough for it. See https://lwn.net/Articles/586838/
For example, you might be far better off wrapping things in pthread_mutex_lock
/ pthread_mutex_unlock
pairs. Declaring a struct as C11 atomic does not guarantee atomic access to the whole struct, only parts of it. So, if you needed the following to be atomic:
glob.x = 5;
glob.y = 7;
glob.z = 9;
You would be better wrapping this in the pthread_mutex_* pairing. For comparison, inside the Linux kernel, this would be spin locks or RCU. In fact, you might use RCU as well. Note that doing:
CAS(glob.x,5)
CAS(glob.y,7)
CAS(glob.z,9)
is not the same as the mutex pairing if you want an all or nothing update.
I'd wrap your implementation in some thin layer. For example, the best way might be __sync on one arch [say BSD] and atomics on another. By abstracting this into a .h file with macros/inlines, you can write "common code" without lots of #ifdef's
everywhere.
I wrote a ring queue struct/object. Its updater could use CAS [I wrote my own inline asm for this], pthread_mutex_*, kernel spin locks, etc. Actual choice of which was controlled by one or two #ifdef's
inside my_ring_queue.h
Another advantage to abstraction: You can change your mind farther down the road. Suppose you did an early pick of __sync or atomics. You code this up in 200 places in 30 files. Then, comes the "big oops" where you realize this was the wrong choice. Lots of editing ensues. So, never put a naked [say] __sync_val_compare_and_swap
in any of your .c files. Put it in once in my_atomics.h as something like #define MY_CAS_VAL(...) __sync_val_compare_and_swap(...)
and use MY_CAS_VAL
You might also be able to reduce the number of places that need interthread locking by using thread local storage for certain things like subpool allocs/frees.
You may also want to use a mixture of CAS and lock pairings. Some specific uses fair better with low level CAS and others would be more efficient with mutex pairs. Again, it helps if you can define your needs first.
Also, consider the final disaster scenario: The compiler doesn't support atomics and __sync is not available [or does not work] for the arch you're compiling to. What then?
In that case, note that all __sync operations can be implemented using pthread_mutex pairings. That's your disaster fallback.