Do parallel algorithms like for_each synchronize with surrounding code?

Question

This came up while thinking about Thread sanitizer warnings after using parallel std::for_each.

Algorithms like std::for_each with parallel execution policies can execute code in worker threads created by the implementation. Do these threads synchronize with the call and return of for_each by the calling thread, or something to that effect? Common sense seems to suggest that they should, but I can't find a guarantee in the C++20 standard.

Consider the following simple example (try on godbolt):

#include <algorithm>
#include <execution>
#include <iostream>

void increment(int &a) {
    a++;
}

int main(void) {
    constexpr size_t n = 1000;
    static int arr[n];
    arr[0] = 3;
    std::for_each(std::execution::par, arr, arr+n, increment);
    std::cout << arr[0] << std::endl;
    return 0;
}

This is intended to always output 4.

The implementation may call increment(arr[0]) in another thread, which does arr[0]++. Does the store arr[0] = 3 in the main thread happen before arr[0]++ in the sense of intro.races p10? Likewise, does arr[0]++ happen before the load of arr[0] in std::cout << arr[0]? I would naively expect that they should, but I can't see any way to prove it. algorithms.parallel doesn't seem to contain anything about synchronization with surrounding code.

If not, then the example contains data races and its behavior is undefined. This would make it rather difficult to use std::execution::par correctly, and I would wonder if it is a defect.

Without such a guarantee, the implementation could conceivably do something like the following:

std::atomic<int *> work = nullptr;

void do_work() {
    int *p;
    while (!(p = work.load(std::memory_order_relaxed)))
        std::this_thread::yield();
    (*p)++;
}

// started at program startup
std::thread worker_thread(do_work);

int main() {
    // ...
    arr[0] = 3;
    // for_each does the following:
    work.store(&arr[0], std::memory_order_relaxed);
    worker_thread.join();
    // ...
}

If it did then we really would have a data race.

I think [this](https://timsong-cpp.github.io/cppwp/algorithms#parallel.exec-9) might be the guarantee that you are looking for — NathanOliver, Dec 08 '21 at 17:01
@NathanOliver: That might take care of the race with following code, if we interpret "blocks on completion of X" to mean that the effects of X *happen before* unblocking, and are visible when the thread unblocks. The standard doesn't actually say that anywhere, which is sort of a separate issue, see https://stackoverflow.com/questions/70228390/c20-how-is-the-returning-from-atomicwait-guaranteed-by-the-standard/70230293#70230293 — Nate Eldredge, Dec 08 '21 at 17:27

score 3 · Answer 1 · answered Dec 08 '21 at 17:03

Using cppreference:

The execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be parallelized. The invocations of element access functions in parallel algorithms invoked with this policy (usually specified as std::execution::par) are permitted to execute in either the invoking thread or in a thread implicitly created by the library to support parallel algorithm execution. Any such invocations executing in the same thread are indeterminately sequenced with respect to each other.

The operations done in threads (logically) created in std::for_each are sequenced-after the thread is created.

From the draft:

The invocations of element access functions in parallel algorithms invoked with an execution policy object of type execution::parallel_policy are permitted to execute either in the invoking thread of execution or in a thread of execution implicitly created by the library to support parallel algorithm execution. If the threads of execution created by thread ([thread.thread.class]) or jthread ([thread.jthread.class]) provide concurrent forward progress guarantees ([intro.progress]), then a thread of execution implicitly created by the library will provide parallel forward progress guarantees; otherwise, the provided forward progress guarantee is implementation-defined. Any such invocations executing in the same thread of execution are indeterminately sequenced with respect to each other.

the wording is slightly different but similar.

I suppose you could weasel around it; there is no explicit guarantee that the thread implicitly created by the library to support parallel algorithm execution need be created (or joined) in the foreach method.

But the postconditions of the various algorithms need to be met, which should deal with the "after" problem; how it is guaranteed that the postconditions have occured before the std::for_each returns isn't specified, but it is guaranteed that the postcondition has occurred. Which to me reads as if the application happens before the std::for_each returns.

For startup sequencing, the best I can do is to read the standard as implying it must behave as-if threads are created for this purpose inside std::for_each, so there is a sequencing guarantee. But I admit this wording is a bit vague, "created by the library" is pretty passive voice.

Ah, right, the postconditions. I guess that's the only thing that prevents `for_each` from firing off all the threads and returning immediately, which would be clearly silly. So it has to promise that the work of the threads is actually "done". And if it were done, but not guaranteed visible yet, then what would be the point? So that seems to reasonably imply synchronization. — Nate Eldredge, Dec 08 '21 at 17:16
As for the startup, it does seem to me like the standard wants to allow for the implementation to use a pool of worker threads created in advance, so I don't think it wants to require that `for_each` itself actually starts or joins the threads it uses. — Nate Eldredge, Dec 08 '21 at 17:18
@NateEldredge You can require "as if" new threads without actual new threads. You just need to clear stuff like the thread_local storage and guarantee happens-before. Once a thread is dead, using information from it (like its ID) is not guaranteed to be reasonable (or unique). (The only observable defined difference between spawning a thread each time, and reusing a pool, is `thread_local` variable initialization and destruction) — Yakk - Adam Nevraumont, Dec 08 '21 at 18:07

Do parallel algorithms like for_each synchronize with surrounding code?

1 Answers1

Linked