Data races, UB, and counters in C++11

Question

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:

int num_times_done_it; // global

void doit() {
  ++num_times_done_it;
  // do something
}

void report_stats() {
  printf("called doit %i times\n", num_times_done_it);
  // and probably some other stuff too
}

Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.

Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.

(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)

Apparent non-solutions:

Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right).
I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything.
There's the awkward solution of having a separate counter per thread and adding them all up in report_stats, but that doesn't solve the data race between doit and report_stats. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.

Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?

EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed:

atomic<int> num_times_done_it;
void doit() {
  num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
                          memory_order_relaxed);
  // as before
}

However, gcc 4.8.2 generates this code on x86_64 (with -O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   83 c0 01                add    $0x1,%eax
   9:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

and clang 3.4 generates this code on x86_64 (again with -O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   ff c0                   inc    %eax
   8:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?

If you don't need the solution to be correct, it is of course possible to optimise it to be arbitrarily quick (with diminishing likelihood of producing useful results). Fast, Correct, Easy; pick two. — BenPope, Apr 11 '14 at 07:06
@BenPope: That's an amusing soundbite, but it's not particularly relevant here. I'm asking how to handle what's at the time of writing an entirely theoretical problem. No compiler I know of will generate unacceptable code for the pattern in my post. However, this may change in the future... — tmyklebu, Apr 11 '14 at 15:00
I'm not sure I understand how having separate counters per thread is awkward. The data race with separate counters between `doit` and `report_stats` is mostly-right -- it is asymptotic to the correct answer. The data race with `doit` between threads is disastrous, giving the wrong answer in the end. — Mark Lakata, Apr 11 '14 at 20:10
@MarkLakata: Yeah. Like I said, this is an entirely theoretical problem; I've never seen a current compiler blow this. However, the C++ standard says that *any* data race constitutes undefined behaviour, even the one between `doit` and `report_stats`. Regarding "disastrous," that depends on how the count is used! The JVM's invocation counts only need to be approximate; Hogwild! can be proven to "repair" a slightly wrong intermediate results; and the end user in the original example might only be interested in an approximation to the number of times a function has been called. — tmyklebu, Apr 11 '14 at 21:19
@MarkLakata: Answering your direct question, I find it awkward in a couple of ways. First, it's a space blowup (perhaps a trivial one for a single counter, but less trivial if for some reason we have lots of them). Second, you'd *still* need to update them atomically to avoid the race between the threads calling `doit` and the thread calling `report_stats`. Fundamentally, though, it relies on an associative operation for combining threads' counters; it's somehow bothersome that a solution to a pure concurrency problem should rely on nontrivial properties of the operations being performed. — tmyklebu, Apr 11 '14 at 21:27
@tmyklebu - the race condition between `doit` and `report_stats` is just one that is writing and one that is reading. Both these operations are atomic for integer (register sized) arithmetic on any architecture I know, so you don't need to use a lock to make them atomic. `report_stats` is either going to report the latest value or an older value, but it will never lie and give a totally wrong value. You do have to have a way of sync'ing pipelines and memory after all the threads complete, but that is a O(1) operation, so I'm not worried how costly that is. — Mark Lakata, Apr 14 '14 at 00:48
@MarkLakata: The C++ memory model is weaker than any machine memory model I know of (except Itanium), though; the data race between `doit` and `report_stats` ruins absolutely everything per the standard. It's the C++ memory model I'm working with here, not any particular machine's. (In particular, C++ explicitly states that *any* data race at all, even the one between `doit` and `report_stats`, results in undefined behaviour.) — tmyklebu, Apr 14 '14 at 01:30

score 0 · Answer 1 · answered Apr 11 '14 at 05:17

0

count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.

And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) but in your case it'll create problems with the caches so it's not advisable.

answered Apr 11 '14 at 05:17

Tobias Langner

10,634
6
46
76

2

this is #3 non-solution stated by OP – Bryan Chen Apr 11 '14 at 05:20
That's the thing; you can't sum them up before you've synchronised on something without that same data race happening. – tmyklebu Apr 11 '14 at 05:20

tmyklebu · Accepted Answer · 2014-04-12T02:00:16.260

0

It seems that the memory_order_relaxed trick is the right way to do this.

This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.

I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.

edited Apr 12 '14 at 02:00

answered Apr 12 '14 at 01:49

tmyklebu

13,915
3
28
57

Separately-atomic load and store, like `store( 1+load(relaxed), relaxed)`, will lose counts if you have a single shared counter; you'll still get asm like you show in the question. See [Can num++ be atomic for 'int num'?](https://stackoverflow.com/a/39396999) for that. You need `count.fetch_add(1, relaxed)` if you don't want per-thread counters. Or separately atomic load/store for per-thread counters if you want it to be remotely efficient, without bouncing a cache line between cores every time (and a full memory barrier on x86 which can't do relaxed RMWs). – Peter Cordes Jan 05 '23 at 12:27

Data races, UB, and counters in C++11

2 Answers2