I have a code block that looks like this
std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
// If any thread finds it true, its true.
// Max value of j is n.
for(auto& j : vec[i])
flags[j] = true;
}
Work based upon the flags.
Is there any need for a mutex ? I understand cache coherence is going to make sure all the write buffers are synchronized, and conflicting buffers will not be written to memory. Secondly the overhead of cache coherence can be avoided by simply changing
flags[j] = true;
to
if(!flags[j]) flags[j] = true;
The check if flags[j] is already set will reduce the write frequency thus need for cache coherency updates. And even if by any chance flags[j] is read to be false it will only end up in one extra write to flags[j] which is okay.
EDIT :
Yes multiple threads may and will try to write to the same index in flags[j]. Hence the question.
uint32_t has intentionally been used and bool is not used since writing to a boolean in parallel can malfunction as the neighboring booleans share the same byte. But writing to the same uint32_t in parallel from different threads will not malfunction in the same manner as booleans even without mutex.
FWIW , to comply with the standards, I ended up keeping this code which more or less complies with the standards not 100% though. But the non-standard code shown above did not fail in tests. I thought for once that it would fail in multi socket machines but turns out x86 also provide multi socket level cache coherence.
#pragma omp parallel
{
std::vector<uint32_t> flags_local(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
for(auto& j : vec[i])
flags_local[j] = true;
}
// No omp directive here, as all threads
// need to traverse their full arrays.
for(int j = 0; j <n; i++) {
if(flags_local[j] && !flags[j]) {
#pragma omp critical
{ flags[j] = true; }
}
}
}