5

Given a shared array of integer counters, I am interested to know if a thread can atomically fetch and add an array element without locking the entire array?

Here's an illustration of working model that uses mutex to lock access to the entire array.

// thread-shared class members
std::mutex count_array_mutex_;
std::vector<int> counter_array_( 100ish );

// Thread critical section
int counter_index = ... // unpredictable index
int current_count;
{
  std::lock_guard<std::mutex> lock(count_array_mutex_);
  current_count = counter_array_[counter_index] ++;
}
// ... do stuff using current_count.

I'd like multiple threads to be able to fetch-add separate array elements simultaneously.

So far, in my research of std::atomic<int> I'm thrown off that constructing the atomic object also constructs the protected member. (And plenty of answers explaining why you can't make a std::vector<std::atomic<int> > )

NoahR
  • 1,417
  • 3
  • 18
  • 33
  • Separate array elements are already separate memory locations for the purposes of the Standard language on data races. (Data races only occur when there are conflicting accesses to the "same memory location") – Ben Voigt Feb 05 '19 at 19:29
  • @BenVoigt I think I understand your point. My use case is trying to illustrate that multiple threads are in this critical section with the same value of `counter_index`. Thus, atomic access to each element is needed. In general, the liklihood of simultaneous matching `counter_index` is much much less than multiple threads in the critical section simultaneously. I conjecture that atomic access to elements should be faster than locking the entire array data structure down to single-thread access. – NoahR Feb 05 '19 at 22:30
  • I guess I don't understand why the concept of a placement-atomic doesn't exist. e.g. a std::atomic constructor that takes an element reference. Access through the atomic wrappers is atomic, access through the original reference is not. Maybe I'm stumbling upon the reason. Are there technical reasons that multiple threads must access through the same atomic *instance*? Rather than one atomic instance per thread, protecting the (potentially) same protected element. – NoahR Feb 05 '19 at 22:39
  • 1
    Yes, it's a crossroads of technical and portability reasons. Many popular platforms guarantee atomicity via a machine-global memory bus lock. These don't require the atomic object to store any supporting state. But atomic access *to objects larger than the platform atomic primitives support* are implemented in software using locks. That requires that all handlers of the atomic data use the same atomic accessor object, so they use the same lock. I had a related question: https://stackoverflow.com/q/18321244/103167 – Ben Voigt Feb 05 '19 at 22:52
  • Do note that for reducing contention, it's perfectly viable to do `std::lock_guard lock{count_array_mutex_[somehash(counter_index)]};` where `somehash` could be as simple as (for example) `counter_index >> 4` or `counter_index & 7`. – Ben Voigt Feb 05 '19 at 23:05
  • Finally, whether you use a single lock, atomic access instructions, or a lock-per-subset, beware of false sharing, which can trash performance by causing cache thrashing. – Ben Voigt Feb 05 '19 at 23:07
  • 1
    @BenVoigt: It's not *actually* a machine-global us lock. You can have N cores each doing an atomic fetch_add on different cache lines in parallel without competing with each other at all. They each just keep the relevant cache line in MESI "Modified" state so they're sure that no other CPU or device in the system can read or write it during the RMW. See my comments under the answer on [Is C++11 atomic usable with mmap?](//stackoverflow.com/posts/comments/95895272), and [Can num++ be atomic for 'int num'?](//stackoverflow.com/a/39414316) – Peter Cordes Feb 06 '19 at 06:09
  • @PeterCordes: I guess you're talking about x86 alone now. That requires machine-wide synchronization to get to the `E` state (prerequisite for `M`). But the programming model is a machine-global lock, even if the cache coherency protocol enables such optimizations, and some x86 processors did use a global bus lock. The global bus lock was very common for single-core (but multiple bus masters) computing, and x86 still has to use the bus lock for (non-cacheable) regions which can be touched by bus masters not participating in cache coherency. – Ben Voigt Feb 06 '19 at 13:58
  • 2
    @BenVoigt: I actually phrased that to be true for LL/SC machines as well, including ARM, PowerPC, MIPS, etc. where instead of delaying response to a request to share (cache lock), the store-conditional would fail if they lost access to the line since the LL. Or not if it stays in M state. But yes, I'm assuming cacheable memory and aligned objects, because that's what you get from any normal C++ implementation for `atomic`. Cache-line split `lock`ed instructions on x86 do need to take a bus lock (very very slow), I think similar to the legacy `#LOCK` signal the manuals mention. – Peter Cordes Feb 06 '19 at 14:06
  • Re: getting a line into `E` / `M` state: yes of course, but CPUs can already do that very efficiently like they do for normal stores. (e.g. the tags of the shared L3 cache in Intel CPUs acts as a snoop filter for all the cores in that package, because it's inclusive. Skylake-X changed this somewhat.) – Peter Cordes Feb 06 '19 at 14:10

2 Answers2

10

C++20 / C++2a (or whatever you want to call it) will add std::atomic_ref<T> which lets you do atomic operations on an object that wasn't atomic<T> to start with.

It's not available yet as part of the standard library for most compilers, but there is a working implementation for gcc/clang/ICC / other compilers with GNU extensions.

Previously, atomic access to "plain" data was only available with some platform-specific functions like Microsoft's LONG InterlockedExchange(LONG volatile *Target, LONG Value); or GNU C / C++
type __atomic_add_fetch (type *ptr, type val, int memorder) (the same builtins that C++ libraries for GNU compilers use to implement std::atomic<T>.)

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0019r8.html includes some intro stuff about the motivation. CPUs can easily do this, compilers can already do this, and it's been annoying that C++ didn't expose this capability portably.

So instead of having to wrestle with C++ to get all the non-atomic allocation and init done in a constructor, you can just have every access create an atomic_ref to the element you want to access. (It's free to instantiate as a local, at least when it's lock-free, on any "normal" C++ implementations).

This will even let you do things like resize the std::vector<int> after you've ensured no other threads are accessing the vector elements or the vector control block itself. And then you can signal the other threads to resume.

It's not yet implemented in libstdc++ or libc++ for gcc/clang.

#include <vector>
#include <atomic>

#define Foo std   // this atomic_ref.hpp puts it in namespace Foo, not std.
// current raw url for https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
#include "https://raw.githubusercontent.com/ORNL/cpp-proposals-pub/580934e3b8cf886e09accedbb25e8be2d83304ae/P0019/atomic_ref.hpp"


void inc_element(std::vector<int> &v, size_t idx)
{
    v[idx]++;
}

void atomic_inc_element(std::vector<int> &v, size_t idx)
{
    std::atomic_ref<int> elem(v[idx]);
    static_assert(decltype(elem)::is_always_lock_free,
           "performance is going to suck without lock-free atomic_ref<T>");

    elem.fetch_add(1, std::memory_order_relaxed);  // take your pick of memory order here
}

For x86-64, these compile exactly the way we'd hope with GCC, using the sample implementation (for compilers implementing GNU extensions) linked in the C++ working-group proposal. https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp

From the Godbolt compiler explorer with g++8.2 -Wall -O3 -std=gnu++2a:

inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
    mov       rax, QWORD PTR [rdi]          # load the pointer member of std::vector
    add       DWORD PTR [rax+rsi*4], 1      # and index it as a memory destination
    ret

atomic_inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
    mov       rax, QWORD PTR [rdi]
    lock add  DWORD PTR [rax+rsi*4], 1     # same but atomic RMW
    ret

The atomic version is identical except it uses a lock prefix to make the read-modify-write atomic, by making sure no other core can read or write the cache line while this core is in the middle of atomically modifying it. Just in case you were curious how atomics work in asm.

Most non-x86 ISAs like AArch64 of course require a LL/SC retry loop to implement an atomic RMW, even with relaxed memory order.

The point here is that constructing / destructing the atomic_ref doesn't cost anything. Its member pointer fully optimizes away. So this is exactly as cheap as a vector<atomic<int>>, but without the headache.

As long as you're careful not to create data-race UB by resizing the vector, or accessing an element without going through atomic_ref. (It would potentially manifest as a use-after-free on many real implementations if std::vector reallocated the memory in parallel with another thread indexing into it, and of course you'd be atomically modifying a stale copy.)

This definitely gives you rope to hang yourself if you don't carefully respect the fact that the std::vector object itself is not atomic, and also that the compiler won't stop you from doing non-atomic access to the underlying v[idx] after other threads have started using it.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Wow. I'm looking forward to `std::atomic_ref` Thank you for the very thorough tutorial. Above and beyond. "just in case you were curious how atomics work in asm". Of course I am! :-P – NoahR Feb 06 '19 at 14:33
  • 1
    @NoahR: Half the motivation for writing this was to find out for myself how `atomic_ref` worked, and to find out if any compilers implemented it yet. I was hoping this is how it would work. :) – Peter Cordes Feb 06 '19 at 14:35
  • 1
    Just to let you know - MSVC v19.28 (since November 2020) now supports atomic_ref and when compiled to /O2 the assembly is optimized relatively well too. There are few extra instructions but no memory overhead https://godbolt.org/z/ePf3oWa43 – CygnusX1 Jul 05 '21 at 14:03
  • @CygnusX1: Yeah, looks like it doesn't know it can optimize to `lock add` when the return value from fetch-add isn't needed. And for some reason it uses LEA instead of a complex addressing mode in `lock xadd`. clang and gcc manage both those optimizations: https://godbolt.org/z/W5Ms6vzTj. But that's still minor, only a bit of code-size and a few extra uops, when the front-end usually isn't the bottleneck for code doing atomic RMWs. – Peter Cordes Jul 05 '21 at 16:49
5

One way:

// Create.
std::vector<std::atomic<int>> v(100);
// Initialize.
for(auto& e : v)
    e.store(0, std::memory_order_relaxed);

// Atomically increment.
auto unpredictable_index = std::rand() % v.size();
int old = v[unpredictable_index].fetch_add(1, std::memory_order_relaxed);

Note that std::atomic<> copy-constructor is deleted, so that the vector cannot be resized and needs to be initialized with the final count of elements.

Since resize functionality of std::vector is lost, instead of std::vector you may as well use std::unique_ptr<std::atomic<int>[]>, e.g.:

// Create.
unsigned const N = 100;
std::unique_ptr<std::atomic<int>[]> p(new std::atomic<int>[N]);
// Initialize.
for(unsigned i = 0; i < N; ++i)
    p[i].store(0, std::memory_order_relaxed);

// Atomically increment.
auto unpredictable_index = std::rand() % N;
int old = p[unpredictable_index].fetch_add(1, std::memory_order_relaxed);
Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • I recon the second approach without vector is better, as resize() is broken. It is hard to understand what resizing and reallocating an array of atomics would mean, in terms of atomic operations, anyway. You could possibly enable casting to a span to allow stl container API to be used? – Gem Taylor Feb 05 '19 at 17:09
  • 2
    @GemTaylor I do not think one can resize a vector while simultaneously reading/updating its elements, hence, no atomicity is necessary when resizing it. – Maxim Egorushkin Feb 05 '19 at 17:32
  • I guess @MaximEgorushkin's answer appears to work. Thank you. This is a bit yucky since I have to work with atomic instances everywhere else in non-threaded regions where it would be more natural to work with the simple integer array[] elements. pending any other approaches. – NoahR Feb 05 '19 at 22:34
  • 3
    @NoahR: [C++2a adds `std::atomic_ref`](https://en.cppreference.com/w/cpp/atomic/atomic_ref), which lets you do lockless stuff in one part of your program, and plain non-atomic one thread at a time access in other parts of your program. e.g. after you've ensured no other threads are accessing the vector, you could resize it and then let other threads resume, as long as all their accesses are through `std::atomic_ref elem(vec[i]);` (which *should* be free to instantiate, at least when it's lock-free.) – Peter Cordes Feb 06 '19 at 05:54