Is mutex needed when all threads set flags in an array based on search result

Question

I have a code block that looks like this

std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for 
for(int i = 0; i <v; i++) {
   // If any thread finds it true, its true. 
   // Max value of j is n.
   for(auto& j : vec[i]) 
      flags[j] = true; 
}

Work based upon the flags.

Is there any need for a mutex ? I understand cache coherence is going to make sure all the write buffers are synchronized, and conflicting buffers will not be written to memory. Secondly the overhead of cache coherence can be avoided by simply changing

flags[j] = true; 

to 

if(!flags[j]) flags[j] = true;

The check if flags[j] is already set will reduce the write frequency thus need for cache coherency updates. And even if by any chance flags[j] is read to be false it will only end up in one extra write to flags[j] which is okay.

EDIT :

Yes multiple threads may and will try to write to the same index in flags[j]. Hence the question.

uint32_t has intentionally been used and bool is not used since writing to a boolean in parallel can malfunction as the neighboring booleans share the same byte. But writing to the same uint32_t in parallel from different threads will not malfunction in the same manner as booleans even without mutex.

FWIW , to comply with the standards, I ended up keeping this code which more or less complies with the standards not 100% though. But the non-standard code shown above did not fail in tests. I thought for once that it would fail in multi socket machines but turns out x86 also provide multi socket level cache coherence.

#pragma omp parallel 
{

std::vector<uint32_t> flags_local(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
   for(auto& j : vec[i]) 
     flags_local[j] = true; 
}

// No omp directive here, as all threads 
// need to traverse their full arrays.
for(int j = 0; j <n; i++) {
   if(flags_local[j] && !flags[j]) {
      #pragma omp critical 
      { flags[j] = true; }
   }
}

}

Is there a possible `j` belong to both `vec[i1]` and `vec[i2]` (so concurrent write to `flags[j]`)? — Jarod42, Jan 17 '22 at 15:25
@YSC Let us assume there is no relation between n and v, but the value of an element of vec[i], where i goes from 0 to v, can at most be equal to n. — Oner, Jan 17 '22 at 15:32
@Jarod42, exactly that that is why I am confused whether I should use a mutex or not, but the thing is in such a case both threads would want to write true, and it is okay if only one succeeds, its not like one is going to cancel the other. — Oner, Jan 17 '22 at 15:33
It is not as simple than that; concurrent write is a race condition, so UB. `std::vector>` would be an alternative to mutex. — Jarod42, Jan 17 '22 at 15:37
@Jarod42, this is specifically my question, what would be the problem if we write the same value to an uint32_t in parallel. Since the write operations to 32 bit integers are atomic for all CPUs. — Oner, Jan 17 '22 at 15:54
@Jarod42 Just note that you cannot reallocate a vector of atomics, since atomic types are neither movable nor copyable. Since C++20, this problem may be solved with `std::atomic_ref`. — Daniel Langr, Jan 17 '22 at 15:55

463035818_is_not_an_ai · Answer 1 · 2022-01-17T15:56:11.913

0

Thread safety in C++ is such that you need not worry about cache coherency and such hardware related issues. What matters is what is specified in the C++ standard, and I don't think it mentions cache coherency. Thats for the implementers to worry about.

For writing to elements of a std::vector the rules are actually rather simple: Writing to distinct elements of a vector is thread-safe. Only if two threads write to the same index you need to synchronize the access (and for that it does not matter whether both threads write the same value or not).

As pointed out by Evg, I made a rough simplification. What counts is that all threads access different memory locations. Hence, with a std::vector<bool> things wouldn't be that simple, because typically several elements of a std::vector<bool> are stored in a single byte.

Yes multiple treads may and will try to write to the same index in flags[j].

Then you need to synrchonize the access. The fact that all elements are false initially and all writes do write true is not relevant. What counts is that you have multiple threads that access the same memory and at least one writes to it. When this is the case you need to synchronize the access or you have a data race.

edited Jan 17 '22 at 15:56

answered Jan 17 '22 at 15:35

463035818_is_not_an_ai

109,796
11
89
185

1

Unless it's a `std::vector`. – Evg Jan 17 '22 at 15:40
Thanks @Evg, yes that is the reason I specifically used uint32_t – Oner Jan 17 '22 at 15:47
@Oner Any reason why not to use just `uint8_t` instead of `uint32_t`? – Daniel Langr Jan 17 '22 at 15:53
@DanielLangr, in other experiments I found that uint8_t introduces more overhead due to cache coherency than uint32_t, and performs much slower. – Oner Jan 17 '22 at 15:59
@Oner Ok, that's likely due to _false sharing_ issues. – Daniel Langr Jan 17 '22 at 16:00

Jarod42 · Answer 2 · 2022-01-17T16:05:06.057

0

Accessing a variable concurrently to a write is a race condition, and so it is undefined behavior.

flags[j] = true;

should be protected.

Alternatively you might use atomic types (but see how-to-declare-a-vector-of-atomic-in-c++).

or even simpler using std::atomic_ref (c++20)

std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for 
for (int i = 0; i < v; i++) {
   for (auto& flag : vec[i]) {
       auto atom_flag = std::atomic_ref<std::uint32_t>(flag);
       atom_flag = true; 
   }
}

edited Jan 17 '22 at 16:05

answered Jan 17 '22 at 15:51

Jarod42

203,559
14
181
302

Write to 32 bit integers are atomic. Even if it was not atomic both threads are trying to write the same value. So the race condition should be okay I think. – Oner Jan 17 '22 at 15:57
@Oner That may be true for some particular architecture, but not generally. Note that even non-aligned 32-bit writes may not be atomic on x86_64. Moreover, without atomics, effects of the as-if rule may, for example, optimize out the reads completely. – Daniel Langr Jan 17 '22 at 15:59
@Oner the C++ standard is hardware agnostic and it says that `uint32_t` is not atomic. – 463035818_is_not_an_ai Jan 17 '22 at 16:02
With OpenMP, you don't even need `std::atomic_ref`. We have `#pragma omp atomic write` (possibly relaxed here). – Daniel Langr Jan 17 '22 at 16:09
@463035818_is_not_a_number I myself have used a different block of code which is safe, I posted this question more out of curiosity. Now assume writing to uint32_t is not atomic, one thread writes the first 16 bits, the other writes the last 16 bits. But they are writing the same value. Yes there is a race condition but its effects are bypassed, as no thread is going to write false. So in the end if its meant to be true, it should be true. Or am I missing anything else. – Oner Jan 17 '22 at 16:10

Is mutex needed when all threads set flags in an array based on search result

2 Answers2