9

I am using a few atomic variables, all unsigned int's, and I wanted to collect them into a structure - effectively a POD. However I also want a constructor because my compiler is not quite c++11 (so I have to define my own constructor to create it with initial values).

So originally I had:

// Names are not the real names - this is just for example
std::atomic<int> counter1;
std::atomic<int> counter2;
std::atomic<int> counter3;

And then I was happy to just increment/decrement them as I needed. But then I decided I wanted a few more counters and therefore to put them into a structure:

struct my_counters {
    int counter1;
    int counter2;
    int counter3;
    // Constructor so that I can init the values I want.
    my_counters(c1, c2, c3) : counter1(c1), counter2(c2), counter3(c3){;}
};

But since I have added a custom constructor this is no longer technically a POD. I was reading other questions regarding this and they where saying that to use std::atomic I need a POD, but other questions I read suggested that the struct needs to be copyable or some such... anyway, I got confused and I want to know if I can safely use my struct my_counters as an atomic type:

std::atomic<my_counters> counters;

And then within various threads:

// Are these operations now still atomic (and therefore safe to use across threads):
counters.counter1++;
counters.counter2--;
counters.counter3 += 4;
code_fodder
  • 15,263
  • 17
  • 90
  • 167
  • 11
    Atomic struct is different that struct with atomic members. With atomic struct, you have to copy the **whole** struct for each modification. – Jarod42 May 30 '18 at 10:00
  • What do you mean by "is it ok" ? Yes, you can perfectly create a strucutre containing several atomic int. It will behave differently than int (non-copyable construcotr, use of load / store function, lack of ++ / -- operator). – Clonk May 30 '18 at 10:02
  • 1
    `atomic` doesn't have a `.counter1` member, so `counters.counter1++;` won't compile. You could implement all 3 modifications with a `cmpxchg` loop, but a 3-`int` struct would only be lock-free on a few platforms (like some compilers for x86-64 with `lock cmpxchg16b`) – Peter Cordes May 30 '18 at 10:03
  • 1
    If you have 3 separate atomic objects, don't put them all in the same struct if you want to use them from different threads. Having them all in one cache line will cause false sharing, so threads using `counter1` will contend with threads using `counter2`. (If they're usually all used at the same time, then in the same cache line is *good*, though.) – Peter Cordes May 30 '18 at 10:05
  • If you want to count 3 counters "simultaneously", I see only two ways: 1. mutex guard the access to (and counting of) counters (and drop atomic) 2. encode all three counters in one atomic _and_ do counting in one atomic operation. How the latter could look like, I wrote in the past as answer for [SO: Atomic operations - C](https://stackoverflow.com/a/44692164/7478597). (It has, of course, certain limitations.) – Scheff's Cat May 30 '18 at 10:14
  • 3
    The requirements for the primary template of `std::atomic` is that `T` is [TriviallyCopyable](http://en.cppreference.com/w/cpp/concept/TriviallyCopyable), not [POD](http://en.cppreference.com/w/cpp/concept/PODType). `my_counters` is TriviallyCopyable – Caleth May 30 '18 at 10:38
  • Ok, so going by the comments and answer - it does feel easier to make the counters individually atomic within a normal struct.. I had not even thought of that : ) – code_fodder May 30 '18 at 10:41
  • @Caleth TriviallyCopyable - that is the word I was looking for : ) – code_fodder May 30 '18 at 10:43
  • 1
    @code_fodder all PODs are TriviallyCopyable, and colloquially people may say POD when they mean "safe to memcpy" – Caleth May 30 '18 at 10:47

2 Answers2

20

Others have said it, but just for clarity, I think you need this:

struct my_counters {
    std::atomic<int> counter1;
    std::atomic<int> counter2;
    std::atomic<int> counter3;
    // Constructor so that I can init the values I want.
    my_counters(c1, c2, c3) : counter1(c1), counter2(c2), counter3(c3){;}
};

And then simply:

my_counters counters;

To put it another way, it's the counters that are atomic, not the struct. The struct just serves to group them together and initialise them.

Edit by Peter

If you use these counters from different threads at the same time, you may want to avoid false-sharing contention between threads by putting each counter in a separate cache line. (Typically 64 bytes). You can use C++11 alignas on the members to get your compiler to pad the struct layout, or manually insert some dummy char padding[60] members between each atomic.

Edit by me

Good link about understanding the cache in general here. Worth reading. Intel cache lines seem to be 64 bytes these days, from just a quick bit of googling, but don't quote me.

Another edit by me

A lot has been said in the comments below about the ins and outs of using std::atomic to look after an (arbitrary) class or struct, e.g.

struct MyStruct
{

    int a;
    int b;
};

std::atomic<MyStruct> foo = { };

But the question I have is this: when is this ever useful? Specifically, as ivaigult points out, you can't use std::atomic to mutate individual members of MyStruct in a threadsafe way. You can only use it to load, store or exchange the entire thing and wanting to do that is not that common.

The only legitimate use case I can think of is when you want to be able to share something like (for example) a struct tm between threads in such a way that a thread doesn't ever see it in an inconsistent state. Then, if the struct is small, you might get away without a lock on your particular platform and that is useful. Just be aware of the implications (priority inversion being the most serious, for realtime code) if you can't.

If you do want to share a struct between threads and be able to update individual members in a threadsafe way, then std::atomic doesn't cut it (and nor was it designed to). Then, you have to fall back on a mutex, and in order to do this it is convenient to derive your struct from std::mutex like so:

struct AnotherStruct : public std::mutex
{

    int a;
    int b;
};

And now I can do (for example):

AnotherStruct bar = { };

bar.lock ().
bar.a++;
bar.b++;
bar.unlock ();

This lets you update two (presumably in some way linked) variables in a threadsafe way.

I'm sorry if all this is obvious to the more seasoned campaigners out there but I wanted to clarify things in my own mind. It actually has nothing to do with the OP's question.

Paul Sanders
  • 24,133
  • 4
  • 26
  • 48
  • Actually I had not inferred that I could use each counter as an atomic within a normal struct, nice : ) ... Also I did not quite understand peter's comment exactly with reference to your answer. Do I still need to use padding if I am using the solution that you propose? +1 btw for making this clear to me – code_fodder May 30 '18 at 10:29
  • The answer to that question depends, actually, on how you are going to access data in shared structs and classes across threads in general. It doesn't apply only to your `atomic`s, although it may be more relevant there. There's an article [here](https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads) about an issue called **false sharing**. The gist of it is that you can optimise flushing of cache lines by adapting the layout and / or padding of your structs to suit the pattern of access to member variables in your various threads. No more space, sorry – Paul Sanders May 30 '18 at 10:42
  • You don't necessarily need to leave the bytes between counters unused, if you want them in separate cache lines. Using it for things that are typically written by the same thread at the same time as the counter is good. (Anything else is potentially bad, though.) Interesting idea to use `alignas()`; that's great for static (i.e. global) vars, but not free for automatic storage (over-aligning the stack costs a few instructions in the function that creates the object.) – Peter Cordes May 30 '18 at 11:06
  • And BTW, no you can't have an atomic struct with atomic members. `atomic` is a non-POD type (no copy constructor), so and `atomic` only works for trivially-copyable `T`. As a hack, you could have a *union* between this an `atomic` and a struct with `atomic` members. But it's a real hack and not portably safe. [How can I implement ABA counter with c++11 CAS?](https://stackoverflow.com/a/38991835) – Peter Cordes May 30 '18 at 11:08
  • Sorry, stepped on your edit. Yes, 64-byte lines are widespread. It's not a coincidence that's the max burst-transfer size of DDR1/2/3/4 SDRAM. – Peter Cordes May 30 '18 at 11:16
  • @PeterCordes NP, tidied it up. – Paul Sanders May 30 '18 at 11:17
  • @PaulSanders & Peter: So after reading the false sharing article. It sounds like this falls under "optimisation" of code - which as you mention this is a good practise. I am thinking that for low volume (as in a few increments here and there), this is probably not an issue. If this was work that needs to do lots of increments/decrements per second then I should definiatly consider padding this out... is that about right? (or just don't be so lazy and do it anyway!) - the reason I am reluctant is that this code would go on various architectures (ARM, Intel, etc...).... – code_fodder May 30 '18 at 11:56
  • If you're happy with the performance you're getting then I wouldn't worry too much. It [the cache] is just a good thing to know about in general - it can have a big effect on tight loops as the link I posted in my edit makes clear. And the cost of a thread-switch is much higher than the cost of flushing a cache line, so bear that in mind too. Also, if you're interested in all this, read up on [locality of reference](https://stackoverflow.com/questions/7638932/what-is-locality-of-reference). That's one reason why placement `new` exists. – Paul Sanders May 30 '18 at 12:09
  • @code_fodder: `struct foo { alignas(64) atomic_int a; alignas(64) atomic_int b; };` is fully portable C++11 and puts the vars in separate cache lines on everything with a line size up to 64. Skipping a cache line is not a problem. This will definitely improve scalability. If you're only using atomics because they're more convenient than lock/unlock then having extra contention between threads that write the values is probably fine. But if you're doing it for performance reasons, then you will get better scalability to more threads / cores if you only put related things in a cache line. – Peter Cordes Jun 01 '18 at 06:51
  • @PaulSanders: You're right that `atomic` is usually the wrong choice if you want to update a single member separately. But you're slightly overstating the case that it's impossible to update a single member: Using a `do { tmp.a += 1; } while(!shared.compare_exchange_weak(old, tmp));` retry loop (which will almost always succeed without a retry) you can atomically RMW or set a member of an object while atomically storing back the other values unchanged. This is obviously horrible for large non-lock-free objects, but if the members are often read together it may be worth it if lock-free – Peter Cordes Jun 01 '18 at 06:56
  • i.e. a CAS retry loop can implement any modification you want to an atomic object. e.g. implement [atomic `+=` or atomic `sqrt` for `atomic`](//stackoverflow.com/q/45055402). But it's not efficient for non-lock-free atomic objects, because current implementations don't recognize the pattern and turn it into an acquire the lock / do the mutation / release the lock. Instead you'd get multiple acquire/releases of the lock for load and CAS. – Peter Cordes Jun 01 '18 at 07:00
  • @PeterCordes - I wouldn't say the alignas approach above is exactly portable because whether extended alignments are supported is implementation defined (or unspecified or something like that). The implementations I've checked largely don't fully support alignments greater than 16 (eg via new). Although I suppose the object will in any case still be padded the way you want, so it should work to keep the two members on separate lines (although possible sharing lines with objects on either side). – BeeOnRope Jun 01 '18 at 16:55
  • Sometimes the L2 adjacent line-prefetch makes it seem like the cache line size is 128 bytes, at least with respect to L2 and higher levels. For example, some cache line detection tools detect the line size to be 128 bytes. I'm not sure if this causes additional false sharing though: perhaps there is some predictor that turns off this behavior under contention or something, or perhaps the prefetch is "weak": ignored at the L3 is the line is already owned elsewhere. – BeeOnRope Jun 01 '18 at 16:57
  • @BeeOnRope: Oh right, I forgot that alignments larger than `alignof(maxalign_t)` were an optional feature, but hopefully it will still give the desired object layout. I was picturing a `static` / global use-case, though, where it's trivial to support the actual alignment, too. Having dynamic allocation in `std::vector` or whatever respect `alignof()` is a C++17 feature. And with more than 2 vars spacing them each by 64B means no more than 2 per 128B, so vastly less contention than with 16 per 64B even if there's some effect wider than 64B. – Peter Cordes Jun 01 '18 at 23:53
  • Yeah @PeterCordes it's really good that C++17 makes supporting it required or at least gives a way to query it or something IIRC. Anyways, I checked around on gcc, clang, icc and MSVC and it seems all do the correct alignment for stack variables (I'm just going to assume it works for globals too, at least up to the file format/dynamic loader limitations, even though that's a completely different mechanism). – BeeOnRope Jun 02 '18 at 00:05
5

In most cases std::atomic is pointless for structures because you will end up with copying the entire structure for every change:

std::atomic<my_counters> var(1,2,3);
my_counters another_var = var.load(); // atomic copying
another_var.counter1++;
var.store(another_var); // atomic copying

Moreover, load and store are separate operations, so we can't guarantee that var.counter1 is 3 for two threads executing the code above.

Also, if your target CPU doesn't support atomic operations for structures of this size, std::atomic will fallback to using mutex:

#include <atomic>
#include <iostream>

struct counters {
    int a;
    int b;
    int c;
};

int main() {
    std::atomic<counters> c;
    std::atomic<int> a;

    std::cout << std::boolalpha << c.is_lock_free() << std::endl;
    std::cout << std::boolalpha << a.is_lock_free() << std::endl;
    return 0;
}

Demo

You may see in the demo, that std::atomic<counters> uses a mutex internaly.

So, you would better to have std::atomic<int> as class members, as Paul suggests.

ivaigult
  • 6,198
  • 5
  • 38
  • 66
  • 2
    not-so-fun fact: gcc7 and later on x86-64 report false for `is_lock_free` on 16-byte atomic objects, but don't use a mutex. `libatomic` uses `lock cmpxchg16b` if it's available. [is\_lock\_free() returned false after upgrading to MacPorts gcc 7.3](https://stackoverflow.com/q/49816855). But yes, there are huge downsides to atomic structs larger than 8 bytes on x86, or larger than the register width on most other architectures. (On x86 in 32-bit mode, 8-byte atomic objects are reasonably efficient with gcc, but not clang.) – Peter Cordes May 30 '18 at 13:39
  • @PeterCordes Yes they're broken, basically. – Paul Sanders May 30 '18 at 13:50
  • 1
    @PaulSanders: What's broken? atomic structs? Not if you use them correctly for the rare cases where they're useful. If you mean gcc7/8, no, there are valid reasons for report 16-byte structs as not lock-free. e.g. pure-load and pure-store are slow, and multiple readers contend with each other. So it's lock free but not *fast*, which is what you usually really want to know. See also [Genuinely test std::atomic is lock-free or not](https://stackoverflow.com/q/49848793). (Using `lock cmpxchg16b` for a 16-byte load requires having the cache line in M state, not shared). – Peter Cordes May 30 '18 at 14:12
  • @PeterCordes I mean the API is broken. Much more useful (in the cases where locking is needed anyway) would be `my_atomic_struct.lock();` ... mutate the object in place ... `my_atomic_struct.unlock();`, but you can't do that. Instead, you have to use a mutex. And I see that ivaigult says all that in his post anyway. – Paul Sanders May 30 '18 at 17:30
  • 1
    @PaulSanders: So you wish there was an API for using the hash-table of locks that non-lock-free atomic structs use internally? But how would your `.lock()` API work for lock-free atomic objects? It wouldn't only be possible to compile it into a `lock cmpxchg` loop (or Intel TSX transaction) if the compiler can see the `.lock()` and the `.unlock()` start and end of the transaction. So maybe the API could take a lambda to be done as an atomic transaction, or something. That might be a good way to let the compiler choose between a mutex vs. a `compare_exchange` loop depending on the target. – Peter Cordes May 30 '18 at 20:30
  • @PeterCordes Actually, my first comment was a bit hasty. I have since realised that you can just derive your struct (or class) from `std::mutex`. Then can do what I describe in my second comment, no problem. But I think the fact remains that `std::atomic` is only really useful for primitive types. I think everyone posting to this thread holds that view. Shall we move all this to chat? – Paul Sanders May 30 '18 at 21:29
  • @PaulSanders: `std::atomic` is useful for a pair of `int`s that go together (like `int max,idx;`), or a pointer + counter, on targets where it's lock-free. Non-lock-free atomic objects are rarely useful, except to let the same source compile on other targets. (See [How can I implement ABA counter with c++11 CAS?](https://stackoverflow.com/a/38991835) for a pointer+counter hack. It can compile very efficiently on 64-bit ILP32 ABIs, like AArch64 ILP32 or x86-64 x32: 32-bit pointers with 64-bit registers. It needs a union hack to be useful in x86-64 with 64-bit pointers.) – Peter Cordes May 30 '18 at 23:37
  • @PaulSanders: Or for a timestamp with separate seconds and nanoseconds members. – Peter Cordes May 30 '18 at 23:39
  • 1
    @PeterCordes Maybe 16-byte structures have "huge" downsides compared to 8-byte for some operations, but the gap between "16 byte lock-free structures that sometimes need cmpxchg16 to read them" and "mutex-using structures" can be absolutely huge for algorithms that care. For some algorithms with 16-byte structures that need to sometimes be modified atomically, fast paths may actually only read an 8-byte part of the structure which can be done quickly (current compilers too stupid, [however](https://godbolt.org/g/DurStt) - and wtf is icc doing there?). – BeeOnRope May 31 '18 at 02:43
  • In other algorithms, two consecutive 8-byte reads (with a load-load barrier) in the right order might be enough (or just do a 16-byte read at this point). I don't think you can even express that last case with `std::atomic<>` - once the structure itself is `std::atomic<>` the members can't be, so you can't safely read the members in a race-free way in C++, but at the assembly level this would be fine. – BeeOnRope May 31 '18 at 02:48
  • @BeeOnRope: right, on x86 (and I assume most ISAs) it's safe to read only the 8 bytes you need of a 16-byte lock-free object, but no current compilers know that (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835), which is why `std::atomic` is pretty crap for some uses. I think it's safe even if you need more than relaxed memory ordering. Paul is not wrong that pure C++11 isn't very good here, and you need to do non-portable stuff unions that take advantage of knowledge of implementation details, or write asm. – Peter Cordes May 31 '18 at 03:23
  • @PeterCordes I think it's safe on _every_ arch that does atomic aligned loads, which I think is pretty much "all of them" (you can't implement most runtimes with a memory model lilke Java efficiently without them). I agree also that it's safe if you need more than a relaxed ordering: basically I think you just need whatever barriers are needed for the same `std::mo` on an `std::atomic` or whatever the smaller size is. Indeed, it's exactly like you loaded the whole 16 bytes and then just threw away the other bytes without looking at them. – BeeOnRope May 31 '18 at 03:32
  • @BeeOnRope: The only case that worried me was store-forwarding to a reload within the same thread, but I guess `lock cmpxchg16b` doesn't just go in the store buffer anyway. And we can assume 16-byte aligned so store-forwarding should always work to either aligned half. I guess if anything I should have been *more* worried about store-forwarding from normal 8-byte stores to 8-byte reloads, because the store buffer does weird things to memory ordering of loads. ([Globally Invisible load instructions](https://stackoverflow.com/q/50609934) / http://yarchive.net/comp/linux/store_buffer.html) – Peter Cordes May 31 '18 at 03:40
  • Let's say forwarding does something terrible (one of the cases you are thinking of and arguing may not occur) - I don't see how this can be problematic? Can you give an example? Note that direct small reads like this don't work if the underlying object is being managed by a lock. So I guess the compilers that are using library calls like `__atomic_load16` are not going to start inlining 8-byte reads even if they got "smart" if they wanted to keep open the option of changing the underlying methods in the runtime shared library. – BeeOnRope May 31 '18 at 03:55
  • @PeterCordes Yes, understood. I can see that in certain cases a `std::atomic` has value. But one needs to know the implications for larger structs, I think that's what I was trying to say. Once `std::atomic ` needs to use a mutex, its behaviour changes radically. – Paul Sanders May 31 '18 at 08:30
  • @PaulSanders: totally agreed; non-lockfree `atomic` is very different and mostly useful as a fallback for portability of code written with one platform in mind (where the struct is lock-free). The existence of `is_always_lock_free` (C++17 IIRC) looks like it's designed to allow compile-time choice between std::atomic vs. a different strategy. – Peter Cordes May 31 '18 at 19:18
  • @BeeOnRope: I'm not currently worried about the correctness of partial loads of atomic objects. I think a load-acquire of the top half of an object will still synchronize-with a release-store of the whole object on any sane ISA. My past worries were based on not understanding memory ordering well enough to be sure there was no problem (and a misunderstanding about x86 store/reload). – Peter Cordes May 31 '18 at 19:24
  • @PeterCordes - this is only true if the object is managed in a lock-free manner by the implementation, right? If it decides to use locks, the direct read of members may break the ordering guarantees (even with relaxed, IMO, but it would take an odd implementation: consider one that decided to use 2x 4-byte writes under the lock). – BeeOnRope May 31 '18 at 19:27
  • `atomic` is pretty cool for at least two reasons: for small structs, like 8 bytes or less that will be lock free, it lets you write some algorithms in a more natural way when the struct is logically composed of a few smaller values, that you want to name and access directly, rather that "stuffing bitwise" everything into a `uint64_t` or whatever (and the compiler is sometimes better at struct access than the equivalent bit-wise shuffling). The other case is if the shared lock implementation is good, it's actually nice to use that rather than rolling your own! – BeeOnRope May 31 '18 at 19:29
  • 1
    People are used to using a mutex per protected object model, but this "global set of shared locks" has a lot of value too, especially for small, common objects. – BeeOnRope May 31 '18 at 19:30
  • @BeeOnRope: Yes, I meant to say "lock-free atomic objects", of course. Semi-related: `atomic` is just bad on current implementations: reading one `int` of a *large* atomic struct would currently take the lock and memcpy the whole struct (in `__atomic_load` with a variable size arg), instead of just reading or writing the one `int` while the lock is held. A compiler *could* optimize a load / modify-one-member / store into modifying that single member while the lock was held. But they don't, so that use-case for `atomic` is not currently viable. – Peter Cordes May 31 '18 at 20:05
  • @PeterCordes - yeah, and the compiler vendors are kind of locking themselves into that model by doing everything in a separate function call. E.g., for large structures they all seem to call an `__atomic_load` call which copies the whole thing, and then they extract the one they want. I guess that lets them hide the details of the locking, and possibly implement it in a hardware-specific way and/or change the mechanism only by updating the library, but that makes it harder to inline smart stuff. I guess they could always have a partial load implementation of `__atomic_load` that takes ... – BeeOnRope May 31 '18 at 20:16
  • ... an offset and returns part of the object. ( BTW, `icc` generates some [awful code](https://godbolt.org/g/Vhs4B3) for this. A whole inlined memcpy implementation doing something pointless. ) It's still early days yet though, maybe the implementations will get better. – BeeOnRope May 31 '18 at 20:16
  • @BeeOnRope: I just single-stepped through `__atomic_store` for a 32-byte object (https://godbolt.org/g/9gCfsR), as compiled by `g++` 7.3 on Arch Linux. It's pretty depressing how much extra work it does. So much vzeroupper, and hashing the address twice (in lock and unlock). Then `pthread_mutex_lock()` has to decide whether to use lock-ellision or not by branching on a static variable. At least two copies of the whole object are done; `__atomic_store` seems to take the object by value? Or at least gcc and clang decide to make an extra copy on the stack. Not as bad as ICC, but still. – Peter Cordes May 31 '18 at 20:34
  • @PeterCordes - oh right, they are using an array of `pthread_mutex_lock()`? Boo. It hardly seems like they need the generality of those locks. Yeah the codegen [is weird](https://godbolt.org/g/a8q7mw). Indeed `std::store( T desired )` takes its argument by value. Seems weird? In principle perhaps the compilers could elide the copy but they don't. In `store8` which puts a temp on the stack they mostly make two copies (gcc is something something a bit better its hard to tell what though). – BeeOnRope May 31 '18 at 20:58
  • @BeeOnRope: looks very little effort has been put into making non-lock-free atomic objects good on gcc's `libatomic`, especially large ones. LLVM's own implementation apparently falls back to a hand-rolled spinlock on platforms other than Apple / FreeBSD (links to code from [Where is the lock for a std::atomic?](https://stackoverflow.com/q/50298358)), but normally on Linux clang uses the system-wide `libatomic` provided by gcc. – Peter Cordes May 31 '18 at 21:03