Can this parallel loop cause a data race?

Question

I have a std::vector filled before the parallel loop with std::pair<Object, bool>. The bools are all initialised to true. The loop is approximately as follows:

for (int x = 0; x < xMax; ++x) // can parallelising the loop in x cause a data race?
    for (int y = 0; y < yMax; ++y)
        for (auto& i : vector)
            if (i.first.ConstantFunctionDependingOnlyOnInput(x, y))
                i.second = false;

Since we're only ever setting the bool tofalse I don't see this causing a data-race, but I don't trust my gut on multi-threading. Operations made on the result of this bool are done in a single thread afterwards (erasing all elements where bool == true in the vector using standard algorithms.

Advice here would be appreciated. I was going to use std::atomics, but of course, they can't be used in a std::vector since they're not copy-constructible.

Cheers!

My final solution was to use a wrapper around atomics as described [here](http://stackoverflow.com/questions/13193484/how-to-declare-a-vector-of-atomic-in-c) and then just have the bools as atomic. It performed imperceptibly slower in single threaded and got a decent, if not perfect parallel speed-up. Results are the same and I can't see there being a data-race as no elements are added or removed. — Goobley, Aug 26 '15 at 12:52

David Schwartz · Accepted Answer · 2018-06-27T14:25:17.040

3

Here's an example of a way this can fail, and real-world code has failed precisely this way.

    for (auto& i : vector)
        if (i.first.ConstantFunctionDependingOnlyOnInput(x, y))
            i.second = false;

The compiler might optimize this code as follows:

for (auto& i : vector);
{
     bool j = i.second;
     bool k = i.first.Function(x, y);
     i.second = k ? false : j;
}

This can cause one thread to overwrite the results of another thread. This can be a legitimate optimization because an unconditional write can be cheaper than a conditional one since it can't be mispredicted.

edited Jun 27 '18 at 14:25

answered Aug 25 '15 at 20:07

David Schwartz

179,497
17
214
278

1

I doubt there is any combination of platform/compiler where this would be of any issue. – SergeyA Aug 25 '15 at 20:09
1

I do too. But that's a horrible way to write software and I have decades of experience seeing that kind of thinking cause extreme pain. – David Schwartz Aug 25 '15 at 20:10
1

To be honoest, I am not even sure if this is undefined. Being one object doesn't mean they are the same memory location. – SergeyA Aug 25 '15 at 20:11
1

How are they the same object? The are members of the same structure, yes, but so are individual entries in an array. i.first and i.second are their own memory locations. – Eclipse Aug 25 '15 at 20:11
Actually, I take that back. I can see a way it can fail. If one thread caches `i.second` in a register while another thread sets it to `false`. – David Schwartz Aug 25 '15 at 20:12
1

There are no reads in this loop. So this caching is of no issue. – SergeyA Aug 25 '15 at 20:12
@Eclipse Look up the C++ definition of a "memory location". The `std::pair` is the memory location in this case. – David Schwartz Aug 25 '15 at 20:13
@SergeyA You have coded no reads. But the compiler may insert reads *and writebacks* as an optimization. This has actually happened and has broken real code. – David Schwartz Aug 25 '15 at 20:13
I haven't coded nothing here. :) there might be reads and writebacks, of course. But since there are no simulatenous modifications of .second, it is not an issue. Of course, I assume there will be any sort of synchronization after the loops are done. – SergeyA Aug 25 '15 at 20:15
3

"A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width." `i.first` and `i.second` are different memory locations. `i.first` may even consist of multiple memory locations. – Jonathan Wakely Aug 25 '15 at 20:17
@SergeyA The optimizer is free to insert a write even where you haven't coded one if it thinks the write is likely and decides an unconditional write is cheaper than a conditional write. This exact optimization has broken real code. – David Schwartz Aug 25 '15 at 20:17
1

@DavidSchwartz @SergeyA If I were to write a wrapper around `std::atomic` for the bool that could be copy constructed (in a non-atomic way), that would definitely be defined behaviour, would it not? Since the vector length is never modified during the parallel loop. – Goobley Aug 25 '15 at 20:17
I am aware of what optimizers can do. But in this case, I do not see the issue. Even if one thread overrides result of another, this should be OK by design, since there is no check to ensure the method runs only once. – SergeyA Aug 25 '15 at 20:20
2

I am starting to think I am missing something. How exactly you want to parallelize the loop? Will there be multiple threads accessing the same element there? If yes, how would you know which result is important? – SergeyA Aug 25 '15 at 20:21
1

@SergeyA I would just parallelise the loop with an openMP directive on the outer loop. So for each pair of (x,y) the ConstantFunction(x,y) gets evaluated. Depending on its return value it may set the bool to false. The ConstantFunction is a pure function, depending only on the values of x and y, hence it parallelises very nicely. – Goobley Aug 25 '15 at 20:24
3

I think the first paragraph about memory locations is incorrect, but it doesn't matter, because the rest of the answer describes the real problem and is unaffected by whether they are the same memory location or not. The non-atomic write to `i.second` tells the compiler that no other threads will be accessing it (because it would be undefined behaviour if they were accessing it, so anything goes), so it can make the transformation David shows, and end up writing `true` back to the value. – Jonathan Wakely Aug 25 '15 at 20:25
1

I do not understand. Given vector to be some constant independent of x and y, you will run your function on the same element x * y times, and you will only have the result of last run. What's the point of it? And given the definition of 'last' being undefined in multithreading, I believe, other problem of possible overrides is moot - you simply have no idea about thre result... – SergeyA Aug 25 '15 at 20:26
1

@SergeyA Each object in vector has its function run with the pair (x,y). If it "succeeds" any of those functions then its corresponding bool is set to false. If the bool is never set to false then it is left as true (the initial state), during the loop the bool is never (explicitly) set to true for this reason. Then, after this loop (in single threaded code) the object is deleted if the bool == true – Goobley Aug 25 '15 at 20:32
Ok, than yes, this is problem. I didn't understand the whole idea first. One thread can easily override the results of another, as David have shown. Now, to solve this easily you might simply use an OS-level atomic functions to access the bool. This would ensure that no compiler optimizations will take place here. – SergeyA Aug 25 '15 at 20:37
@JonathanWakely is correct... The "same memory location" issue has nothing to do with this. – Nemo Aug 25 '15 at 21:36
@Nemo Both issues exist. By C++ rules, a `std::pair` is a single memory location because it's allocated as a contiguous unit, has a `sizeof`, and so on. It is not permitted to modify a memory location in one thread while another thread is, or might be, accessing it. – David Schwartz Aug 25 '15 at 21:38
1

@JonathanWakely Can you explain why you think it's incorrect? Do you think `i.first` and `i.second` are not the same memory location, even though they're allocated with a single allocation, must be freed as a unit, and you can call `sizeof` on the allocated type? Or do you think it's not a rule that you can't access a memory location in one thread while another thread is, or might be, modifying it? Or do you think his code doesn't access `i.first` while possibly modifying `i.second`? Why exactly do you think I'm wrong? – David Schwartz Aug 25 '15 at 21:49
2

@DavidSchwartz: See http://stackoverflow.com/q/18303212/. Is the answer there wrong? Or do you think that struct is somehow fundamentally different from `std::pair` ? – Nemo Aug 26 '15 at 00:06
1

@DavidSchwartz, see my first comment above, quoting the standard. "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width." `i.second` is clearly a scalar object. The complete pair is clearly not a sequence of adjacent bit-fields. It is not a single memory location. It is entirely possible (and required by the standard) to be able to concurrently access individual sub-objects in an array, or a `std::pair`, or a struct, without data races. Reading from `i.first` **does not conflict** with writing to `i.second`. – Jonathan Wakely Aug 26 '15 at 09:50
1

If memory location was determined by the chunks you allocate in then if I have `char a[4097]` and one thread writes to `a[0]` and another writes to `a[4096]` you'd get a data race. Among other problems that would make it very difficult for an implementation to meet the requirements of [container.requirements.dataraces]/2. Individual sub-objects are only part of the same memory location if they are adjacent bit-fields of non-zero width. – Jonathan Wakely Aug 26 '15 at 09:56
@JonathanWakely Isn't that for C++ 11 only? – David Schwartz Aug 26 '15 at 16:47
1

C++03 doesn't define "memory location" at all, nor data race, nor what it means for two accesses to conflict. POSIX doesn't define what it means by "memory location" (http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap04.html and http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html talk about it) – Jonathan Wakely Aug 26 '15 at 17:38
To repeat @JonathanWakely, the last sentence of your first paragraph is either meaningless (C++03) or flat-out wrong (C+11 and beyond). Any thoughts of fixing it? – Nemo Aug 27 '15 at 12:13
@DavidSchwartz "_The optimizer is free to insert a write even where you haven't coded one if it thinks the write is likely_" Why? – curiousguy Jun 28 '18 at 03:28
@curiousguy Because that is a legitimate optimization that only causes issues for code that's broken anyway and often results in significant performance improvements. That's the same rationale for permitting any other optimization. Multithreaded code that relies on anything other than semantics guaranteed by that particular platform or threading standard is broken. (This exact optimization has in fact been observed in real world systems and has caused code that relied on the optimizer not adding writes to break. A conditional writes may be more expensive than an extra write.) – David Schwartz Jun 28 '18 at 04:03
@DavidSchwartz You keep saying that the code is broken anyway and that the compiler is free to do that. We understand that this is your thesis. Where is your demonstration? – curiousguy Jun 28 '18 at 04:56
@curiousguy I don't understand what you're asking for. Are you arguing that some rule prohibits an optimizer from adding writes? If so, what is the source of this rule and where does it come from? Are you claiming no real world optimizer has ever added a write? Or are you claiming that any optimizer that does so is broken? Any code that relies on an optimizer not to add a write is broken because nothing prevents an optimizer from adding writes and that is in fact a legitimate optimization that optimizers actually do. – David Schwartz Jun 28 '18 at 05:04
@curiousguy Oh, do you want the demonstration that the code is broken anyway? That's simple -- the code can modify an object while it is, or might be, modified in another thread. Most threading standards define that as UB, so the code is broken. (If you happen to have a threading standard that defines what happens in this case, then of course what it says must happens must happen. But most don't and provide atomic objects or the like for precisely to permit code like this to have well-defined behavior when you use them.) Optimizations that only break code that's broken anyway are always okay. – David Schwartz Jun 28 '18 at 05:07
@DavidSchwartz Are you saying that according to the std, compilers can, in general, turn a conditional write to an unconditional write? And that they do? What allows compilers to add a write that wasn't present in the source code? – curiousguy Jun 28 '18 at 05:25
@DavidSchwartz "_Oh, do you want the demonstration that the code is broken anyway?_" Exactly. "_the code can modify_" Then prove it. And then, prove it locally, without being super smart and doing advanced logic. Show that the compiler can possibly prove it while doing mostly local analysis and trivial global analysis. – curiousguy Jun 28 '18 at 05:28
@curiousguy Yes, I'm saying that compilers can (and some do) turn a conditional write into an unconditional write. What allows them to is the general rule that they can perform any optimization that can't change the behavior of code that isn't broken, the "as-if rule". This code is broken on platforms that prohibit accessing an object in one thread while another thread might be modifying it, so a compiler can make optimizations that change its behavior. And real world compilers do because conditional writes are expensive. – David Schwartz Jun 29 '18 at 00:32
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/173997/discussion-between-curiousguy-and-david-schwartz). – curiousguy Jun 29 '18 at 01:09

score -2 · Answer 2 · edited May 23 '17 at 11:58

-2

You're correct - this will behave exactly as you expect (no data race) on any real-world system. While officially undefined behavior according to the C++ standard, real-world systems don't work that way. Here's an answer to a more broad question that includes this one.

Here's the text from the standard that says this is officially undefined, though:

Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.

If you want standard-guaranteed safety, you could consider atomic memory accesses.

edited May 23 '17 at 11:58

Community

1
1

answered Aug 25 '15 at 20:06

IanPudney

5,941
1
24
39

4

You mean the real-world systems that exist today. But why would you want to write code that might fail horribly on the next CPU, compiler version, or STL implementation? – David Schwartz Aug 25 '15 at 20:08
1

Because it might be more efficient. People often sacrifice portability for efficiency - in the case of the OP, his algorithm may become several times slower if he put a lock around the bool access, depending on how frequently he writes the bool. – IanPudney Aug 25 '15 at 20:11
I've updated my answer to include the middle-ground option of atomic memory access. – IanPudney Aug 25 '15 at 20:12
Atomics are out of question here, they can not be part of vector. – SergeyA Aug 25 '15 at 20:13
@IanPudney I'm not even sure how to lock this algorithm without putting a mutex in each tuple, since there are a lot of objects in that vector and a global bool lock would probably have worse performance that single threading – Goobley Aug 25 '15 at 20:14
@Goobley that is my point exactly - he'd have to use a mutex for each bool or a global mutex. Either way, pretty slow. – IanPudney Aug 25 '15 at 20:15
2

@SergeyA it's more complicated but possible: http://stackoverflow.com/questions/13193484/how-to-declare-a-vector-of-atomic-in-c – IanPudney Aug 25 '15 at 20:15
In the question you linked there is only one writer, in this there is at least many writers. – Chris Drew Aug 25 '15 at 20:38

Can this parallel loop cause a data race?

2 Answers2