Does this multithreaded list processing code have enough synchronization?

Question

I have this code in test.cpp:

#include <atomic>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <thread>

static const int N_ITEMS = 11;
static const int N_WORKERS = 4;

int main(void)
{
    int* const items = (int*)std::malloc(N_ITEMS * sizeof(*items));
    for (int i = 0; i < N_ITEMS; ++i) {
        items[i] = i;
    }

    std::thread* const workers = (std::thread*)std::malloc(N_WORKERS * sizeof(*workers));
    std::atomic<int> place(0);
    for (int w = 0; w < N_WORKERS; ++w) {
        new (&workers[w]) std::thread([items, &place]() {
            int i;
            while ((i = place.fetch_add(1, std::memory_order_relaxed)) < N_ITEMS) {
                items[i] *= items[i];
                std::this_thread::sleep_for(std::chrono::seconds(1));
            }
        });
    }
    for (int w = 0; w < N_WORKERS; ++w) {
        workers[w].join();
        workers[w].~thread();
    }
    std::free(workers);

    for (int i = 0; i < N_ITEMS; ++i) {
        std::cout << items[i] << '\n';
    }

    std::free(items);
}

I compile like so on linux:

c++ -std=c++11 -Wall -Wextra -pedantic test.cpp -pthread

When run, the program should print this:

I don't have much experience with C++, nor do I really understand atomic operations. According to the standard, will the worker threads always square item values correctly, and will the main thread always print the correct final values? I'm worried that the atomic variable will be updated out-of-sync with the item values, or something. If this is the case, can I change the memory order used with fetch_add to fix the code?

It appears that you are creating 4 threads. Each thread will attempt to square all the elements of the array items. Why do you want each element squared four times? That will certainly not produce your expected output. — Jim Rogers, Apr 06 '21 at 19:48
@JimRogers The code does produce the desired output when I test it. I create one atomic variable which all the threads increment. This should prevent multiple threads from squaring the same element. — JudeMH, Apr 06 '21 at 19:51

score 1 · Accepted Answer · answered Apr 06 '21 at 20:34

Looks safe to me. Your i = place.fetch_add(1) allocates each array index exactly once, each one to one of the threads. So for any given array element, it's only touched by a single thread, and that's guaranteed to be safe for all types other than bit-fields of a struct¹.

Footnote 1: Or elements of std::vector<bool> which the standard unfortunately requires to be a packed bit-vector, breaking some of the usual guarantees for std::vector.

There's no need for any ordering of these accesses while the worker threads are working; the main thread join()s the workers before reading the array, so everything done by the workers "happens before" (in ISO C++ standardese) the main thread's std::cout << items[i] accesses.

Of course, the array elements are all written by the main thread before the worker threads are started, but that's also safe because the std::thread constructor makes sure all earlier stuff in the parent threads happens-before anything in the new thread

The completion of the invocation of the constructor synchronizes-with (as defined in std::memory_order) the beginning of the invocation of the copy of f on the new thread of execution.

There's also no need for any order stronger than mo_relaxed on the increment: it's the only atomic variable in your program, and you don't need any ordering of any of your operations except the overall thread creation and join.

It's still atomic, so it's guaranteed that 100 increments will produce the numbers 0..99, simply no guarantee about which thread gets which. (But there is a guarantee that each thread will see monotonically increasing values: for a every atomic object separately, a modification order exists, and that order is consistent with some interleaving of the program-order of modifications to it.)

Just for the record, this is hilariously inefficient compared to having each worker pick a contiguous range of indices to square. That would only take 1 atomic access per thread, or the main thread could just pass them positions.

And it would avoid all the false sharing effects of having 4 threads loading and storing into the same cache line at the same time as they move through the array.

Contiguous ranges would also let the compiler auto-vectorize with SIMD to load/square/store multiple elements at once.

Thanks. Using a rough benchmark on the actual code the example is based on, I have found that incrementing the atomic is consistently a bit faster than dividing up the array. I think this is because in the actual code, processing an element takes an unpredictable amount of time. If I were to divide up the array, I would have to do some kind of work stealing to make it more efficient, which I don't want to bother with. — JudeMH, Apr 06 '21 at 21:11
@JudeMH: That seems surprising. Are you sure you have optimization enabled (`g++ -O3 -march=native`)? And are you actually using a `sleep(1)` inside each worker? If you have a *huge* amount of work to do for each element, I guess that could mostly dwarf the cost of doing many extra atomic increments. Still, having a worker grab itself a *range* of work to do, with some chunk-size greater than 1, seems like a good idea. So for N threads, you might divide the work up into 8N chunks to account for speed differences. (Like OpenMP "dynamic" scheduling.) — Peter Cordes, Apr 06 '21 at 21:18
I do have optimization on. The workers don't call `sleep(1)`. I'll take your advice and try splitting the work into larger chunks. (I should mention that each element is aligned to prevent false sharing and is over a kilobyte in size.) — JudeMH, Apr 06 '21 at 21:57
@JudeMH: Oh, that's vastly different, already a significant chunk of work for each atomic increment, then. But still, probably something to gain from grabbing multiple chunks at once, stalling each core less often for access to the atomic element. (In x86 asm, the only way to do an atomic RMW is a `lock`ed operation, which includes a full memory barrier; i.e. it's equivalent to seq_cst except for compile-time reordering. So the atomic increment can't overlap with other memory operations, and your work can't start until you get an index anyway.) — Peter Cordes, Apr 06 '21 at 22:43

Does this multithreaded list processing code have enough synchronization?

1 Answers1

Linked

Related