This came up while thinking about Thread sanitizer warnings after using parallel std::for_each.
Algorithms like std::for_each
with parallel execution policies can execute code in worker threads created by the implementation. Do these threads synchronize with the call and return of for_each
by the calling thread, or something to that effect? Common sense seems to suggest that they should, but I can't find a guarantee in the C++20 standard.
Consider the following simple example (try on godbolt):
#include <algorithm>
#include <execution>
#include <iostream>
void increment(int &a) {
a++;
}
int main(void) {
constexpr size_t n = 1000;
static int arr[n];
arr[0] = 3;
std::for_each(std::execution::par, arr, arr+n, increment);
std::cout << arr[0] << std::endl;
return 0;
}
This is intended to always output 4
.
The implementation may call increment(arr[0])
in another thread, which does arr[0]++
. Does the store arr[0] = 3
in the main thread happen before arr[0]++
in the sense of intro.races p10? Likewise, does arr[0]++
happen before the load of arr[0]
in std::cout << arr[0]
? I would naively expect that they should, but I can't see any way to prove it. algorithms.parallel doesn't seem to contain anything about synchronization with surrounding code.
If not, then the example contains data races and its behavior is undefined. This would make it rather difficult to use std::execution::par
correctly, and I would wonder if it is a defect.
Without such a guarantee, the implementation could conceivably do something like the following:
std::atomic<int *> work = nullptr;
void do_work() {
int *p;
while (!(p = work.load(std::memory_order_relaxed)))
std::this_thread::yield();
(*p)++;
}
// started at program startup
std::thread worker_thread(do_work);
int main() {
// ...
arr[0] = 3;
// for_each does the following:
work.store(&arr[0], std::memory_order_relaxed);
worker_thread.join();
// ...
}
If it did then we really would have a data race.