Using GCC __sync extensions for a portable C library

Question

I am developing a C library on OS X (10.10.x which happens to ship with GCC 4.2.x). This library is intended to be maximally portable and not specific to OS X.

I would like the end users to have the least headaches in building from source. So while the project is coded to std=c11 to get some of the benefits of the most modern C, it seems optional matter such as atomics are not supported by this version of GCC.

I am assuming GNU-Linux and various BSD end users to have either (a) a later version of GCC, or (b) the chops to install the latest and greatest.

Is it a good decision to rely on the __sync extensions of GCC for the required CAS (etc.) semantics?

On OS X you probably should choose clang and not gcc. gcc 4.2 is really way old. Modern versions of clang and gcc have `__atomic` extensions that come closer to what C11 provides, and from gcc 4.9 or so you really have C11 atomics. The — Jens Gustedt, Oct 24 '15 at 19:31
Look at https://trac.mpich.org/projects/openpa/wiki/FAQ or http://preshing.com/20130505/introducing-mintomic-a-small-portable-lock-free-api/. — Jeff Hammond, Oct 25 '15 at 00:47
It says "not specific to OS X" on the first paragraph, @JensGustedt. Native OS X is [a non-issue](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/OSAtomicCompareAndSwap32.3.html). Just want to know if *relying* on widespread availability of 4.2.x __sync extensions is reasonable. — alphazero, Oct 27 '15 at 13:39

score 1 · Answer 1 · answered Oct 24 '15 at 21:21

1

I think you need to take a step back and first define all your use cases. The merits of __sync vs C11 atomics aside, better to define your needs first (i.e. __sync/atomics are solutions not needs).

The Linux kernel is one of the heaviest, most sophisticated users of locking, atomics, etc. and C11 atomics aren't powerful enough for it. See https://lwn.net/Articles/586838/

For example, you might be far better off wrapping things in pthread_mutex_lock / pthread_mutex_unlock pairs. Declaring a struct as C11 atomic does not guarantee atomic access to the whole struct, only parts of it. So, if you needed the following to be atomic:

glob.x = 5;
glob.y = 7;
glob.z = 9;

You would be better wrapping this in the pthread_mutex_* pairing. For comparison, inside the Linux kernel, this would be spin locks or RCU. In fact, you might use RCU as well. Note that doing:

CAS(glob.x,5)
CAS(glob.y,7)
CAS(glob.z,9)

is not the same as the mutex pairing if you want an all or nothing update.

I'd wrap your implementation in some thin layer. For example, the best way might be __sync on one arch [say BSD] and atomics on another. By abstracting this into a .h file with macros/inlines, you can write "common code" without lots of #ifdef's everywhere.

I wrote a ring queue struct/object. Its updater could use CAS [I wrote my own inline asm for this], pthread_mutex_*, kernel spin locks, etc. Actual choice of which was controlled by one or two #ifdef's inside my_ring_queue.h

Another advantage to abstraction: You can change your mind farther down the road. Suppose you did an early pick of __sync or atomics. You code this up in 200 places in 30 files. Then, comes the "big oops" where you realize this was the wrong choice. Lots of editing ensues. So, never put a naked [say] __sync_val_compare_and_swap in any of your .c files. Put it in once in my_atomics.h as something like #define MY_CAS_VAL(...) __sync_val_compare_and_swap(...) and use MY_CAS_VAL

You might also be able to reduce the number of places that need interthread locking by using thread local storage for certain things like subpool allocs/frees.

You may also want to use a mixture of CAS and lock pairings. Some specific uses fair better with low level CAS and others would be more efficient with mutex pairs. Again, it helps if you can define your needs first.

Also, consider the final disaster scenario: The compiler doesn't support atomics and __sync is not available [or does not work] for the arch you're compiling to. What then?

In that case, note that all __sync operations can be implemented using pthread_mutex pairings. That's your disaster fallback.

answered Oct 24 '15 at 21:21

Craig Estey

30,627
4
24
48

2

Just because C11 atomic doesn't work for Linux doesn't mean you should discourage its use. Linux is special relative to almost anything else, and certainly relative to userspace code written by the type of SO user that asks this question. – Jeff Hammond Oct 25 '15 at 00:48
Does pthread mutex cost less than three atomic stores on any platform you use? – Jeff Hammond Oct 25 '15 at 00:49
@Jeff Having C11 do this at all is bad because it can never have enough global information to not do bad code motion optimizations. That's what the link detailed. This is properly done in a lib because you need to specify limits with explicit barriers that the compiler can't divine. And, in the triple CAS example [consider changing 50 things in the struct], it's not the same because you want atomic update of the _struct_. With the individual CAS's a reader will see inconsistent state (e.g. it gets new value for x, but old value for y/z because it snapped data _between_ cas(x) and cas(y)) – Craig Estey Oct 25 '15 at 02:15
I read all of that link and many of the links from it and I don't share your paranoid view, especially with respect to GCC intrinsics, which may be just as susceptible to bad optimizations. – Jeff Hammond Oct 25 '15 at 02:18
and you're right about the struct update, unless of course one uses eg 16b CAS to do all 12 bytes worth of updates at once :-) – Jeff Hammond Oct 25 '15 at 02:19
@Jeff Not paranoid. I've implemented what OP wants to do in commercial products [in userspace]. When C11 atomics were first proposed, MS got them watered down to the point of toys [deliberate sabotage]. I've been doing thread locking for 30+ so this is the voice of experience. I make my livelihood finding bad locks, race conditions, timing issues, etc. You prevent bad code motions by inserting _asm volatile "" ::: "memory"_ at key points, but that is done by the programmer. Put all this in atomics.h but leave it out of the lang--it's the wrong place to do it. – Craig Estey Oct 25 '15 at 02:42
First, let's get one thing straight: inline assembly is not part of any ISO language and one is relying strictly upon the charity of compiler developers - the supposed evildoers that Linus fears - to function at all. Second, C11 atomics are most certainly part of the language via the _Atomic keyword, which should be sufficient to keep compilers from doing evil things. Even if ISO C11 doesn't explicitly prohibit certain perverse optimizations, it is reasonable to expect them to be avoided - even Linus says this - because otherwise people will find better compilers. – Jeff Hammond Oct 25 '15 at 02:49
@Jeff You missed my point about update 50 things, so 16b cas doesn't work [and needs alignment]. Do you seriously think I've never used it before? I roll my own primitives with inline funcs and inline asm. And ARM has no 16b cas. It only has ldex/stex. – Craig Estey Oct 25 '15 at 02:55
I didn't miss it. I agreed with it and felt no need to comment to validate it. As for ARM not supporting useful features, as an Intel employee, I'm inclined to say that's their problem, not mine :-) – Jeff Hammond Oct 25 '15 at 02:59
@Jeff What about a heavily contended whatever (e.g. 200 cores vying for a lock or atomic). In the kernel, they have a "fairness" mechanism (see queued spin locks). In userspace, if you can't [repeatedly] get the lock or whatever, what do you do? Under Linux, you issue a futex syscall. BSD something else. But, the compiler shouldn't choose this because you may want futex for lock A but want something else for lock B. stdio.h ships with gcc/glibc but would you want FILE to be a compiler intrinsic type? Put all this in atomics.h that ships with gcc/glibc but leave out of lang – Craig Estey Oct 25 '15 at 03:03
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93276/discussion-between-jeff-and-craig-estey). – Jeff Hammond Oct 25 '15 at 03:42
I appreciate the comment @CraigEstey. I was hoping to avoid platform dependent via preprocessor macros. – alphazero Oct 25 '15 at 17:06
@CraigEstey just a p.s. regarding the links, which I finally got to read. Thank you, very informative. – alphazero Oct 29 '15 at 18:45
1

@alphazero You'll definitely be interested in this. It's hot off the press: https://stackoverflow.com/questions/33083270/atomically-increment-two-integers-with-cas – Craig Estey Oct 29 '15 at 22:39
@CraigEstey Just saw this. Thanks for the heads-up, Craig. – alphazero Dec 02 '15 at 14:38
@alphazero Did it "come up red" or did you just notice by looking over the old list? The reason I ask is that I just got a +10 for an unrelated page that had the same date of October 29. I'm wondering if it's just coincidence or maybe SO had a glitch on that day. – Craig Estey Dec 02 '15 at 19:55
Red. (Looking at liburcu & C.K. btw.) – alphazero Dec 04 '15 at 14:22
@alphazero Also, take a look at linux kernel Documentation on rcu--very informative--they have sophisticated needs for RCU--may give you some ideas. The video [youtube] link for the cppcon talk--that guy knew his stuff. – Craig Estey Dec 05 '15 at 03:48

Using GCC __sync extensions for a portable C library

1 Answers1