gcc atomic read and writes

Question

I have a multithreaded application where I one producer thread(main) and multiple consumers.

Now from main I want to have some sort of percentage of how far into the work the consumers are. Implementing a counter is easy as the work that is done a loop. However since this loop repeats a couple of thousands of times, maybe even more than a million times. I don`t want to mutex this part. So I went looking into some atomic options of writing to an int.

As far as I understand I can use the builtin atomic functions from gcc: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html

however, it doesn`t have a function for just reading the variable I want to work on.

So basically my question is.

can I read from the variable safely from my producer, as long as I use the atomic builtins for writing to that same variable in the consumer

or

do I need some sort of different function to read from the variable. and what function is that

Couldn't you use `__sync_fetch_and_add(ptr, 0)` to read `ptr`? — i_am_jorf, Jun 10 '14 at 19:19
You can safely read a 32 bit integer variable on x86 without using atomics. You might need to declare the variable as `volatile` so that the compiler does not optimize the reads away. — markgz, Jun 10 '14 at 19:26
Atomicity is an overloaded term. One thing is to read values in one go, guaranteeing they are not cacheline split or page split for example and can fetch partial data bytes from different observation times. Another thing is making sure these reads are synchronized with other reads/writes using some memory ordering model. — Leeor, Jun 10 '14 at 19:46
@markgz To elaborate a bit on nwp's comment... If you had said you could safely read an 8-bit integer variable without using atomics, it would be a lot easier to believe (although I'm still not 100% convinced of that in the completely general case). But 32-bit values could easily be split across cacheline or page boundaries, or simply misaligned, as hinted at by Leeor, which makes that safety very much not true for any variable involving more than one byte. — twalberg, Jun 10 '14 at 19:52
@twalberg That is not quite the point. x86 guarantees atomic reads and writes of integers unless they are located on different cache lines. But C and compilers do not. They will assume no data race will happen and optimize based on that assumption making the optimizations wrong. Same with signed integer overflows. As far as I know every single platform in existence will just loop around. C says it is undefined behavior which will break the code. See [this paper](http://usenix.org/event/hotpar11/tech/final_files/Boehm.pdf) for why data races break code even if two threads write the same value. — nwp, Jun 10 '14 at 20:56
The Intel Software developer's manual vol 3, section 8.1.1 on page 8-2 says "the 486 and newer processors guarantees that reading [...] a [naturally aligned] 32 bit integer will be atomic." In addition P6 and newer processors guarantee atomicity for a miss-aligned 32 bit integer that does not cross a cache line boundary. — markgz, Jun 10 '14 at 21:13
@markgz Thats what I wrote. The code will still not work even on those platforms if the compiler optimizes the code. — nwp, Jun 10 '14 at 21:19

moonshadow · Accepted Answer · 2014-06-10T19:59:58.643

3

Define "safely".

If you just use a regular read, on x86, for naturally aligned 32-bit or smaller data, the read is atomic, so you will always read a valid value rather than one containing some bytes written by one thread and some by another. If any of those things are not true (not x86, not naturally aligned, larger than 32 bits...) all bets are off.

That said, you have no guarantee whatsoever that the value read will be particularly fresh, or that the sequence of values seen over multiple reads will be in any particular order. I have seen naive code using volatile to defeat the compiler optimising away the read entirely but no other synchronisation mechanism, literally never see an updated value due to CPU caching.

If any of these things matter to you, and they really should, you should explicitly make the read atomic and use the appropriate memory barriers. The intrinsics you refer to take care of both of these things for you: you could call one of the atomic intrinsics in such a way that there is no side effect other than returning the value:

__sync_val_compare_and_swap(ptr, 0, 0)

or

__sync_add_and_fetch(ptr, 0)

or

__sync_sub_and_fetch(ptr, 0)

or whatever

edited Jun 10 '14 at 19:59

answered Jun 10 '14 at 19:32

moonshadow

86,889
7
82
122

Cache coherency on Intel processors guarantees that if any processor writes to a shared line, that line will be invalidated in the sharing processor's caches and the updated value will be read the next time a sharing processor reads that line. – markgz Jun 10 '14 at 21:19
1

@markgz x86 guarantees will not translate to guarantees for C. – nwp Jun 10 '14 at 21:20
@nwp could you explain why not? – markgz Jun 10 '14 at 21:22
3

@markgz [Example](http://ideone.com/3z0RaK): `int i = INT_MAX; if (i+1 < i) printf("overflow");` does not necessarily print "overflow". The condition `i+1 < i` can never be true in C, because an overflow is UB. x86 guarantees that INT_MAX +1 == INT_MIN. C does not. C on x86 does not. If you try the same thing with an `unsigned int` and `UINT_MAX` you get the overflow because that is defined. Same with thread safety of ints. x86 guarantees it. C on x86 does not. See [this paper](http://usenix.org/event/hotpar11/tech/final_files/Boehm.pdf) for details of what compilers can do with data races. – nwp Jun 10 '14 at 21:36

score 2 · Answer 2 · edited May 23 '17 at 12:14

If your compiler supports it, you can use C11 atomic types. They are introduced in the section 7.17 of the standard, but they are unfortunately optional, so you will have to check whether __STDC_NO_ATOMICS__ is defined to at least throw a meaningful error if it's not supported.

With gcc, you apparently need at least version 4.9, because otherwise the header is missing (here is a SO question about this, but I can't verify because I don't have GCC-4.9).

score 0 · Answer 3 · answered Jun 10 '14 at 19:40

I'll answer your question, but you should know upfront that atomics aren't cheap. The CPU has to synchronize between cores every time you use atomics, and you won't like the performance results if you use atomics in a tight loop.

The page you linked to lists atomic operations for the writer, but says nothing about how such variables should be read. The answer is that your other CPU cores will "see" the updated values, but your compiler may "cache" the old value in a register or on the stack. To prevent this behavior, I suggest you declare the variable volatile to force your compiler not to cache the old value.

The only safety issue you will encounter is stale data, as described above.

If you try to do anything more complex with atomics, you may run into subtle and random issues with the order atomics are written to by one thread versus the order you see those changes in another thread. Unfortunately you're not using a built-in language feature, and the compiler builtins aren't designed perfectly. If you choose to use these builtins, I suggest you keep your logic very simple.

`volatile` is not sufficient: it affects only the compiler, which in a modern multicore system with weakly consistent memory is the least of your problems. As a side effect of dealing with the other problems, `volatile` becomes superfluous. http://stackoverflow.com/questions/2484980/why-is-volatile-not-considered-useful-in-multithreaded-c-or-c-programming — moonshadow, Jun 10 '14 at 19:43
Yes, I covered that. The GCC atomic builtins are poorly designed. The guarantees provided by `volatile` are sufficient for his very simple use case when combined with the GCC atomics. For more complicated use cases he'll need to use a properly designed system. If you have a better alternative to `volatile` for use with GCC builtins, please share. — Sophit, Jun 10 '14 at 19:46
hmm, for my use case this should be enough but since I want to use this in a production application I want to use the 'best' way possible. Is there some alternative you can suggest to me? — , Jun 10 '14 at 20:19

score -1 · Answer 4 · answered Jun 10 '14 at 20:20

If I understood the problem, I would not use any atomic variable for the counters. Each worker thread can have a separate counter that it updates locally, the master thread can read the whole array of counters for an approximate snapshot value, so this becomes a 1 consumer 1 producer problem. The memory can be made visible to the master thread, for example, every 5 seconds, by using __sync_synchronize() or similar.

gcc atomic read and writes

4 Answers4

Linked