10

Using relaxed memory order, e.g. for a reference counting pointer, would the compiler be allowed to optimize away a subsequent increment and decrement?

std::atomic_int32_t ai;
for (size_t i = 0; i < 10000; i++)
{
    ai.fetch_add(1, std::memory_order_relaxed);
    ai.fetch_sub(1, std::memory_order_relaxed);
}

Looking at disassembly it doesn't look like. But since reordering is allowed and the atomic is behaving like a counter, just thread-safe, one could argue that he could optimize as if it would be a plain int.

Alex Lop.
  • 6,810
  • 1
  • 26
  • 45
Borph
  • 842
  • 1
  • 6
  • 17
  • Don't you already have your answer if you see that it is not optimized? – rustyx Aug 04 '16 at 11:29
  • 7
    @rustyx: That's not how it works. Just because the OP's implementation did not optimise it _this time_, doesn't mean such an optimisation is prohibited. – Lightness Races in Orbit Aug 04 '16 at 11:32
  • Tried to answer this, but I'm not sufficiently familiar with the terminology in §29.3. Looking forward to the answer! – Lightness Races in Orbit Aug 04 '16 at 11:42
  • I think compiler can or could optimize this away unless you bring in explicit memory fencing. But again, I wouldn't say anything with full certainty with anything concerning atomics :) – Arunmu Aug 04 '16 at 11:56
  • Perhaps this link would give some more insight. http://www.playingwithpointers.com/ipo-and-derefinement.html – Arunmu Aug 04 '16 at 11:57
  • 29.3 has *Implementations must still guarantee that any given atomic access to a particular atomic object be indivisible with respect to all other atomic accesses to that object.* but I am unsure what that actually means. – NathanOliver Aug 04 '16 at 12:03
  • I think that the compiler cannot optimize it out. The reason is that other threads can access `ai` asynchronously. In this case, asynchronous access to `ai` before, between and after the `add..sub` will have a visible effect. Thus the compiler cannot optimize out such code. – Alex Lop. Aug 04 '16 at 12:07
  • @AlexLop. https://www.reddit.com/r/cpp/comments/4t9sj1/interprocedural_optimization_and_derefinement/ – Arunmu Aug 04 '16 at 12:10
  • @AlexLop.: Incorrect. Yes, there can be asynchronous access, but the fact that `ai` is atomic merely means that the asynchronous access is legal and returns _a_ past value of `ai`. (This is of course a weaker guarantee than a full-fledged mutex, but that's the point: you trade off speed for a weaker guarantee). The hypothesized optimization does not break this guarantee. In fact, the optimization would even be possible under a full mutex lock. – MSalters Aug 04 '16 at 12:27
  • @MSalters You are probably correct if the statement about the "weaker guarantee" is true by definition. One has to be a lawyer in order to prove if the compiler is (or is not) allowed to optimize in such case... – Alex Lop. Aug 04 '16 at 12:31
  • @AlexLop. : Not true by definition, but still fairly obvious. Locking and unlocking a mutex gives you a memory barrier, plus it causes the second thread to be sequenced-after the first thread. So that means the second thread sees the **last** value written by the first thread. With an atomic variable, there's the chance of seeing an older value, but it still has to be a value that the variable once held. – MSalters Aug 04 '16 at 12:37

2 Answers2

5

I believe it can be optimized, unless declared volatile. The reason is that for any schedule that interleaves some thread in between, there exist valid schedule that does not. I believe it to be the case for drf-sc memory model too.

It would not be the case if this thread reads something in between.

Elazar
  • 20,415
  • 4
  • 46
  • 67
  • 1
    I agree. In addition, with relaxed memory ordering there are no guarantees when modification will became visible for other threads and you cannot prove that you don't just never caught moment between two changes. If there was aquire-release semantic or other synchronization, then it is another case. – Revolver_Ocelot Aug 04 '16 at 12:16
  • 2
    @Revolver_Ocelot: Even acquire-release wouldn't be sufficient. There is a trivial thread scheduling, where all 10.000 increments and 10.000 decrements happen on this thread before any other thread runs a single instruction. (That's even a realistic thread scheduling on a single-core processor ) – MSalters Aug 04 '16 at 12:29
  • ISO C++ as written allows it, but compilers currently *won't*. If this is the count for a hand-written lock, it would be bad if a compiler destroyed fairness by removing the chance for another thread to take the lock between an unlock and re-lock in one thread. See also [Why don't compilers merge redundant std::atomic writes?](https://stackoverflow.com/q/45960387) - compilers all atomic accesses pretty much like volatile, and will probably keep doing so until there's a better-defined way to control this in cases that are harder. (Here `volatile atomic` is certainly enough.) – Peter Cordes Jul 10 '22 at 10:17
-3

the compiler cannot optimize away atomics, since that would violate what they're there for. It has to assume another thread might touch the value also so removal is not allowed.

it also cannot optimize/reorder stuff that is 'visible' to the C++ code from before to after (and other way around) since the atomics are memory barriers.

oreubens
  • 329
  • 1
  • 9
  • 1
    Nope, you're confusing them with `volatile`. Those are non-removable. And as for the memory barriers, note that the two operations are back-to-back. That is to say, there's no synchronization point in the middle where another thread is guaranteed able to see or touch the intermediate result. In order to see an intermediate result of thread X in thread Y, thread Y must be sequenced-after the calculation of the intermediate in X, but sequenced-before the overwrite in X. – MSalters Aug 04 '16 at 12:23