0

I am new to atomic in C++ and trying to make a tuple from atomic objects. I am getting a compile time error and I do not understand why. How can I resolve the error?

Created this test program

int main()
{
    std::atomic<double> a1{0};
    std::atomic<double> a2{0};
    std::atomic<double> a3{0};

    // Parallel processing
    ParallelFor(...) {
       // update atomic variables.
    }
    std::make_tuple(a1,a2,a3);
    return 0;
}

Compile time error:

In instantiation of 'constexpr std::tuple<typename std::__decay_and_strip<_Elements>::__type ...> std::make_tuple(_Elements&& ...) [with _Elements = {std::atomic<double>&, std::atomic<double>&, std::atomic<double>&}]':
progatomic.cpp:17:26:   required from here
error: no matching function for call to 'std::tuple<std::atomic<double>, std::atomic<double>, std::atomic<double> >::tuple(std::atomic<double>&, std::atomic<double>&, std::atomic<double>&)'
       return __result_type(std::forward<_Elements>(__args)...);

Thanks

Pirate
  • 29
  • 3
  • 7
    You can't copy atomics, did you perhaps want `std::tie` instead to store references in your tuple? – Alan Birtles Jan 13 '23 at 18:36
  • @AlanBirtles: Or if they just want the current values, adding `.load()` (optionally passing a `std::memory_order`) to each use within `std::make_tuple`, making it `std::make_tuple(a1.load(), a2.load(), a3.load())` (assuming the slow sequential consistency load is okay), but that would allow for races (observing each value at a slightly different point in time, which may or may not be a problem). – ShadowRanger Jan 13 '23 at 18:42
  • @ShadowRanger yep, depends what they want which is why the standard library doesn't let this code compile because it doesn't know what they want either – Alan Birtles Jan 13 '23 at 18:43
  • Though now that I think about it, sequential consistency would probably be useless (because the arguments aren't evaluated in a specified order, so imposing sequential consistency is pointless, you couldn't even say which is guaranteed to be read first), so using acquire semantics would get all the features you can actually rely on, I think. – ShadowRanger Jan 13 '23 at 18:46
  • If they are not copyable, why is this ok. std::tuple t1(a1,a2,a3); – Pirate Jan 13 '23 at 19:01
  • @Pirate: They're not copyable, but they do define `operator T` (the cast operator to let them convert to their underlying non-atomic type via an implicit `load`). So if it's unambiguous that they must become a `double`, you're fine, but since you didn't explicitly template `std::make_tuple`, it *derives* the templated types as `std::atomic`, and therefore tries to copy construct. Any situation in which it *knows* it's copying to a plain `double` will be fine. My answer shows both ways to fix this (explicit loads with implicit templating, or implicit loads with explicit templating). – ShadowRanger Jan 13 '23 at 19:03
  • @Pirate: See [std::atomic passed as a const reference to a non-atomic type](https://stackoverflow.com/q/71943060/364696) for an explanation of the implicit casting rules. – ShadowRanger Jan 13 '23 at 19:21

1 Answers1

1

The code has two rules applied that are causing the problem:

  1. Type inferencing: You're using std::make_tuple with type-inference, so it's trying to make a tuple templated on the type of the arguments (std::tuple<std::atomic<double>, std::atomic<double>, std::atomic<double>>), copying from the arguments
  2. Non-copyable types: std::atomic is non-copyable

There are at least three different ways to fix this:

  1. If you want a tuple of references to the original std::atomics, use std::tie, e.g. std::tie(a1, a2, a3). No actual data is read from the atomics, so you won't get any complaints, but the tuple will now contain references to std::atomic<double>s that may keep changing.

  2. If you want a tuple of the values currently in the atomics (which will be extracted in an unspecified order, since C++ makes no guarantees on the order in which arguments are evaluated, and therefore you can't guarantee any particular ordering of results if another thread is still modifying them), do one of two things so std::make_tuple knows it's making copies of the actual double values, not the atomics themselves, either:

    1. Explicitly load from the variables: std::make_tuple(a1.load(), a2.load(), a3.load()). For efficiency, since sequential consistency can't actually guarantee the order the arguments are loaded in, you might want to relax the memory ordering requirements explicitly, with std::make_tuple(a1.load(std::memory_order_acquire), a2.load(std::memory_order_acquire), a3.load(std::memory_order_acquire)), or even std::memory_order_relaxed instead. Technically, if the values were stored without std::memory_order_release or stronger, dropping to acquire might allow you to see inconsistent state for non-atomics, but if that's an issue, you could use:

      std::make_tuple(a1.load(std::memory_order_relaxed), a2.load(std::memory_order_relaxed), a3.load(std::memory_order_relaxed));
      std::atomic_thread_fence(std::memory_order_seq_cst);
      

      to get the best of both worlds; no wasted work per load, just a single fence to guarantee nothing written prior to those loads is missed when non-atomics are read afterwards.

    2. Explicitly template make_tuple so it implicitly casts to the underlying value type, rather than inferring std::atomic<double>: std::make_tuple<double, double, double>(a1, a2, a3) (downside: Since the load is now implicit, you can't relax the memory ordering; three memory fences are going to be involved)

    The two approaches are fixing one of the two issues from above, either removing type inference through explicit templating (preserving implicit loads), or removing copying through explicit loads that convert to a copyable type (preserving implicit templating). Either one (or both) would solve the problem, because the problem only occurs when both type-inferencing and non-copyable types are involved.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Thank you the great explanation. I am doing make_tuple outside the parallelization loop. Can you please help me understand what is the difference between: std::make_tuple(a1.load(std::memory_order_acquire), a2.load(std::memory_order_acquire), a3.load(std::memory_order_acquire)) AND std::tuple t1(a1,a2,a3); – Pirate Jan 13 '23 at 19:24
  • @Pirate: The latter implicitly does the equivalent of calling `load` with `std::memory_order_seq_cst`, which means a full memory fence occurs three times, once for each `load`. On a modern x86-64 chip, any load ordering below sequential consistency is effectively free thanks to the strongly-ordered memory model, but a full memory fence is at least few dozen cycles each (Agner Fog's tables are in the 30-40 range), by comparison, a plain load from memory is 0.5-1 cycles on modern chips. So three unfenced loads cost about 1.5-3 cycles, while three fenced loads would be over 100 cycles. – ShadowRanger Jan 13 '23 at 20:33
  • I'm fairly sure the real costs of `mfence` are higher than the tables make it seem (I seem to recall it locks the bus and effectively causes *all* cores to stutter a bit to ensure the data is consistent), but don't quote me on that. Even if it's exactly as cheap as the tables say, it's still many times more expensive than an unfenced load. For non-x86, the costs will differ for even relaxed vs. acquire; the safest solution does involve fencing, but it's definitionally cheaper to do it just once, not three times, so three relaxed loads followed by a single fence is best of both worlds. – ShadowRanger Jan 13 '23 at 20:35
  • In any event, addressing the way memory orders work is a *huge* topic, well beyond the scope of this question. There've been many questions asked on the topic already you can look up, or look up outside resources for what atomic memory orderings mean. The main important thing is that relaxed is generally cheaper than acquire, acquire is generally (much) cheaper than sequential consistency, and *anything* is cheaper than *three* sequentially consistency loads, which is what `std::tuple t1(a1, a2, a3);` involves. – ShadowRanger Jan 13 '23 at 20:39
  • 3 relaxed loads and then a `seq_cst` *fence* is a lot slower than 3 acquire loads or even 3 seq_cst loads on many common ISAs, notably x86 and AArch64 where `seq_cst` loads are cheap. (On AArch64, cheap as long as you didn't do a release or seq_cst store soon before that; `ldar` has to wait for earlier `stlr` to drain from the store buffer, but not earlier plain / relaxed stores). I don't see what problem you could avoid (or synchronization you could create) with a seq_cst fence if the store side only used `relaxed`. There's nothing to sync with in that case, if it didn't use a fence either – Peter Cordes Jan 14 '23 at 02:46
  • x86 `mfence` (or more normally `lock add dword [rsp], 0` because that's faster on real CPUs) does not lock the bus or disturb other cores, it "just" pauses this core until the store buffer drains. That's still quite slow. (Ideally just blocks later loads, but `mfence` on Skylake and later blocks OoO exec of all later instructions after a microcode update: [Which is a better write barrier on x86: lock+addl or xchgl?](https://stackoverflow.com/q/4232660)). See also [Which is a better write barrier on x86: lock+addl or xchgl?](https://stackoverflow.com/q/4232660) – Peter Cordes Jan 14 '23 at 02:51
  • @PeterCordes: I'll admit to being out of practice on atomics, so you're almost certainly correct. In the code shown, I doubt there are non-atomics involved, but yeah, if there are, and there was no effort made to store the atomics with greater than relaxed consistency it might cause a serious problem. If you have a good replacement, I'll trust you to edit it in. I thought *all* uses of sequential consistency caused a fence under the hood on x86 (it was the only level of ordering that did), but it's possible I memorized the general rule and load is special. – ShadowRanger Jan 14 '23 at 02:58
  • One way to think about it is that recovering seq_cst on top of x86's strongly ordered memory model which only allows StoreLoad reordering (and store-buffer / store-forward effects) is that you only need to prevent seq_cst stores from reordering with later seq_cst loads. Only one or the other needs a full barrier, and the sensible choice is to make loads fast; thus we do SC stores with `xchg` (or `mov`+`mfence`), and loads with just plain `mov`. . Weaker operations aren't part of the single total order of SC operations, so e.g. an earlier release store can be allowed to reorder w. an SC load. – Peter Cordes Jan 14 '23 at 03:02
  • Or as https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html shows: on x86 the only thing that needs extra ordering beyond what you'd do anyway is SC store, or of course SC thread_fence. All atomic RMWs need a `lock` prefix so are strong enough for SC. That page also discusses the fact that cheap loads are better than cheap stores. – Peter Cordes Jan 14 '23 at 03:05
  • Anyway, like I said, IDK what problem you're avoiding. If the storing thread stored some other values and then did `a1.store(1.2, relaxed)`, there's no amount of fencing you can do in the reader to make it safe to read non-atomic variables assigned before that `a1.store`. If the writer did do a release or SC store, then 3 relaxed loads and one `fence(acquire)` could be better as an alternative to 3 acquire loads (on some ISAs like PowerPC or maybe 32-bit ARM; break even on x86, worse on AArch64 I think), but I can't think of a case where a `seq_cst` fence would make sense here. – Peter Cordes Jan 14 '23 at 03:10