Is dereferencing a READ ONLY non-atomic pointer to an atomic object in different threads safe?

Question

If I write something like this:

std::atomic<bool> *p = new std::atomic<bool>(false); // At the beginning of the program  

//...  

void thread1()  
{  
    while (!(*p))  
        // Do something  
}  

//...  

void thread2()  
{  
    //...  
    *p = true;  
    //...
}

thread1 and thread2 will run simultaneously. The value of p is never changed since it is initialized. Is the dereference operation safe in this case? I want to avoid using atomic pointers for performance reasons.

If the value of p never changes, why do you not use a plain bool variable? The operation is as save as that, so yes it is, at least when p is int-aligned. — PMF, Feb 01 '14 at 07:34
@PMF p is just a pointer, the value of p doesn't change but *p does change. — Zizheng Tai, Feb 01 '14 at 07:41
@ZizengTai: Yea, but why dont you yust use a bool? Makes your code quite a bit simpler, bacause no dereferencing is needed. — PMF, Feb 01 '14 at 07:47
@PMF Sorry my bad, but the code I post is just a simplification. Actually I need to put a lot of atomic bools into a std::list, but since std::list needs copy construction of the object (which is deleted from any std::atomic), it seems achievable only through pointers. — Zizheng Tai, Feb 01 '14 at 07:53
@ZizhengTai: Yeah putting atomics inside containers is a pain. I suggest inheriting from std::atomic and making the copy-constructor do what you expect it to do, and using that class instead. It's much better than this. — user541686, Feb 01 '14 at 08:18
[Putting `std::atomic` in a container is just fine if you do nothing that requires them to be copied or moved](http://coliru.stacked-crooked.com/a/cfc15044d41fbaf7). If you are having a problem using an `atomic` in a container, post a question. — Casey, Feb 02 '14 at 00:53

Casey · Answer 1 · 2014-02-02T09:14:00.327

4

Yes, it is safe. You can't have a data race without at least one thread modifying the shared variable. Since neither thread modifies p, there is no race.

edited Feb 02 '14 at 09:14

answered Feb 01 '14 at 07:44

Casey

41,449
7
95
125

@kuroineko You are mistaken. `*p` ("the bool") is atomic, so the memory model guarantees that any write sequenced before the write to `*p` in `thread2` happens-before any read in another thread that is sequenced after a read of `*p` that synchronizes with that write. If the "slave" observes the write of `true` to `*p`, it is guaranteed to see any other updates the "master" made to "another variable" before the corresponding write. – Casey Feb 02 '14 at 00:45
I'm sorry, I misinterpreted the OP's question as "can you poll this variable without atomic access", as you can see in my answer. Of course as long as the bool is atomic you can poll it, though I still think it's a poor way to present multitasking synchronization to people who don't intend to specialize in multitasking. Edit your question and I'll cancel my downvote right away. – kuroi neko Feb 02 '14 at 08:42
1

@kuroineko I wholeheartedly agree with you that atomics are an extremely low-level tool and that the OP would likely be better served with some kind of task-based parallelism. – Casey Feb 02 '14 at 09:13

score 0 · Answer 2 · answered Feb 01 '14 at 09:37

The code you posted and the question are two different things.

The code will work because you do not dereference a non-atomic pointer. You dereference a std::atomic<bool>* which will result (operator overloading) in a sequentially consistent fetch/store. This is probably less efficient than necessary (most of the time such a flag is used for a release operation), but it is safe.

Otherwise, dereferencing a valid non-atomic pointer to whatever (including an atomic variable) is safe as long as no other thread modifies the data.

Dereferencing a non-atomic pointer with another thread writing to it is still "safe" insofar as it will not crash. There are however no formal guarantees that memeory is not garbled (for aligned PODs there is a very practical guarantee due to how processors access memory, though), but more importantly it is unsafe insofar as there are no memory ordering guarantees. When using such a flag, one normally does something like this:

do_work(&buf); // writes data to buf
done = true;   // synchronize

This works as intended with one thread, but it is not guaranteed to work properly in presence of concurrency. For that, you need a happens-before guarantee. Otherwise, it is possible that the other thread picks up the update to the flag before the write to the data has been realized.

I'm sorry but I don't quite understand...you say "The code will work because you **do not** dereference a non-atomic pointer", but I thought `std::atomic *` is just a pointer (as any pointer is). Do you mean it overload the `*` operator? And reading a shared memory block in one thread and writing it in another simultaneously **will not** cause the program to crash, but might create dirty data right? — Zizheng Tai, Feb 01 '14 at 16:57
`std::atomic` is a class template with specializations for many types (among these `bool`). It has overloads for `operator=` and `operator T`, which means if you assign to it (directly or via a pointer that you dereference) you are calling one of these overloaded functions. These, in turn, call functions like e.g. `atomic_load` which is a shorthand for `atomic_load_explicit(memory_order_seq_cst)`. So yes, this is safe. — Damon, Feb 01 '14 at 20:23
Dereferencing a pointer that is concurrently modified will not crash (presumed that the pointer is valid, of course). But it _may_ (at least in theory) return a garbled result. In practice, all modern CPUs operate on cache lines, it is not possible to write something smaller than a complete cache line. Also, in practice, on all mainstream CPUs, `bool` is either 1 or 0, so it's not possible to have some other "garbage value" at all. However, note that without proper atomic access, you do not have happens-before guarantee. That means that things may not be _seen_ in the order that they happened. — Damon, Feb 01 '14 at 20:26

score 0 · Answer 3 · edited May 23 '17 at 11:55

0

dereferencing (that is: reading an address) is atomic on intel architectures. Furthermore, since constant, I guess that it is going to be correct no only on Intel/AMD. However look at this post for more information.

clarification: it is possible on other architectures that a thread is switched out while writing to an address, when only part of the address is modified, so the address read by the other thread would be invalid.

With Intel this cannot happen if the address is aligned in memory.

Furthermore, since *p is a std::atomic<bool>, it already implements all that is needed (native, asm, memory fences).

edited May 23 '17 at 11:55

Community

1
1

answered Feb 01 '14 at 09:40

Sigi

4,826
1
19
23

In this example, the single write to `p` happens before the threads are created and therefore synchronizes with their constructors. There are no concurrent writes to `p` while the threads are running, so accesses to `p` do not need to be atomic regardless of the CPU architecture. – Casey Feb 02 '14 at 00:57

score -1 · Accepted Answer · edited May 23 '17 at 10:32

It depends what is around your two accesses. If the master writes some data just before setting the boolean, the slave needs a memory barrier to make sure it will not read the said data before the boolean.

Maybe for now your thread is just waiting on this boolean to exit, but if one day you decide the master should, for instance, pass a termination status to the slaves, you code might break.
If you come back 6 months later and modify this piece of code, are you certain you will remember that the area beyond your slave loop is a no-shared-read zone and the one before you master boolean update a no-shared-write zone?

At any rate, your boolean would need to be volatile, or else the compiler might optimize it away. Or worse, your coworker's compiler might, while you'll be off laying another piece of unreliable code.

It is a well known fact volatile variables are usually not good enough for thread synchronization, because they don't implement memory barriers, as in this simple example:.

master :

// previous value of x = 123
x = 42;
*p = true;

bus logic on slave processor:

write *p = true

slave:

while (!*p) { /* whatever */ }
the_answer = x; // <-- boom ! the_answer = 123

bus logic on slave's processor:

write x = 42 // too late...

(symetric problem if the master's bus writes are scheduled out of order)

Of course, chances are you will never witness such a rare occurence on your particular desktop computer, just like you could run by chance a program vandalizing its own memory without ever crashing.

Nevertheless, software written with such leaky synchronizations are ticking timebombs. Compile and run them long enough on a host of bus architecture and one day... Ka-boom!

As a matter of fact, C++11 is hurting multiprocessor programming a lot by allowing to create tasks like if there was nothing to it, and in the same time offering nothing but crappy atomics, mutexes and conditions variables to handle the synchronization (and the bloody awkward futures, of course).

The simplest and most efficient way to synchronize tasks (especially worker threads) is to have them process messages on a queue. That is how drivers and real-time software work, and so should any multiprocessor application unless some extraordinary performance requirements show up.

Forcing programmers to control multitasking with glorified flags is stupid. You need to understand very clearly how the hardware works to play around with atomic counters.
The pedantic clique of C++ is again forcing every man and his dog to become experts in yet another field just to avoid writing crappy, unreliable code.

And as usual, you will have the gurus spouting their "good practices" with an indulgent smile, while people burn megajoules of CPU power in stupid spinning loops inside broken homebrewed queues in the belief that "no-wait" synchronization is the alpha and omega of efficiency.

And this performance obsession is a non-issue. "blocking" calls are consuming nothing but crumbs of the available computational power, and there is a number of other factors that hurt performances by a couple of orders of magnitude above operating system synchronization primitives (the absence of a standard way to locate tasks on a given processor, for a start).

Consider your thread1 slave. Accessing an atomic bool will throw a handful of sand into the bus cache cogwheels, slowing down this particulmar access by a factor of about 20. THat is a few dozen cycles wasted. Unless your slave is just twiddling its virtual thumbs inside the loop, this handful of cycles will be dwarved by the thousands or millions a single loop will last. Also, what will happen if your slave is done working while its brother slaves are not ? Will it spin uselessly on this flag and waste CPU, or block on whatever mutex?
That is exactly to address these problems that message queues were invented.

A proper OS call like a message queue read would maybe consume a couple of hundred cycles. So what?
If your slave thread is just there to increment 3 counters, then it is your design that is at fault. You don't launch a thread to move a couple of matchsticks, just like you don't allocate your memory byte per byte, even is such a high level language as C++.

Provided you don't use threads to munch breadcrumbs, you should rely on simple and proven mechanisms like waiting queues or semaphores or events (picking the posix or Microsot ones for lack of a portable solution), and you would not notice any impact on performances whatsoever.

EDIT: more on system calls overhead

Basically, a call to a waiting queue will cost a few microseconds.

Assuming your average worker crunches number for 10 to 100 ms, the system call overhead will be indiscernable from background noise, and the thread termination responsiveness will stay within acceptable limits ( < 0.1 s).

I recently implemented a Mandelbrot set explorer as a test case for parallel processing. It is in no way representative of all parallel processing cases, but still I noticed a few interesting things.

On my I3 Intel 2 cores / 4 CPUs @3.1 GHz, using one worker per CPU, I measured the gain factor (i.e. the ratio of execution times using 1 core over 4 cores) of parallelization of pure computing (i.e. with no data dependency whatsoever between workers).

localizing the threads on one core each (instead of letting the OS scheduler move the threads from one core to another) boosted the ratio from 3.2 to 3.5 (out of a theoretical max of 4)
beside locking threads to distinct cores, the most notable improvements were due to optimizations of the algorithm itself (more efficient computations and better load balancing).
the cost of about 1000 C++11 mutex locks used to let the 4 workers draw from a common queue amounted to 7 ms, i.e. 7 µs per call.

I can hardly imagine a high performance design doing more than 1000 synchronizations per second (or else your time might be better spent working to improve the design), so basically your "blocking" calls would cost well under 1% of the power available on a rather low-cost PC.
The choice is yours, but I am not sure implementing raw atomic objects right from the start will be the decisive factor in performances.

I would advise to start with simple queues and do some benchmarking. You can use the pthread posix interface, or take for instance this pretty good sample as a base for a conversion to C++11.

You can then debug your program and evaluate the performances of your algorithms in a syncronization-bugs-free environment.

If the queues prove to be the real CPU hogs and your algorithm cannot be refactored to avoid excessive synchronization calls, it should be relatively easy to switch to whatever spinlocks you assume to be more efficient, especially if your computations have been streamlined and data dependencies sorted out beforehand.

P.S: if that's not a trade secret, I would be glad to hear more about this algorithm of yours.

Thank you! Actually I'm stuck in a predicament...the whole story is: The flags (yeah there are more than one of them) are used to tell some worker threads to exit. Truth be told part of my code is a bit nasty about memory, because this actually is an EXTREMELY PERFORMANCE-SENSITIVE program used to do scientific computation. It's nature requires that the worker threads use mutex and atomic things as less as possible, but user experience requires a way to terminate worker threads responsively from within the main thread. I wonder if there is any way that can efficiently achieve this. — Zizheng Tai, Feb 01 '14 at 12:06
So I think waiting queues and the other mechanisms you mentioned may be a bit too expensive for this particular purpose. — Zizheng Tai, Feb 01 '14 at 12:09
I can hardly be a judge of that without having a look at your algorithms or at least an idea of the kind of computations they do. Anyway, you might want to see my edit for a rough performances estimation on a working test case. — kuroi neko, Feb 01 '14 at 18:51
XD Yeah it's not a trade secret. Simply put, the algorithm uses depth-first-search to generate ordered number combinations, and then generate (mathematical) graphs magically out of the numbers with some theorems applied. After a graph is generated at the leaf level, the algorithm tests some properties of the graph. What I'm doing is to tell the algorithm to check the atomic flag at several nodes of the DFS tree, and once it find the flag is set, it ends the search. — Zizheng Tai, Feb 01 '14 at 19:14
Though it is hard to describe the whole picture clearly here, I guess you can realize that the deeper the checking of the flag is located, the less responsive the program will be when user presses a "Pause" button. And to be honest, the whole algorithm is a closely combined intricate mess, also the nature of DFS determines that it can be bloody tricky to try to insert a queue, event loop, etc. into the whole structure, since the tree is not going to quit to the "outter space" where the event loop and his buddies live before all the computation is finished. — Zizheng Tai, Feb 01 '14 at 19:23
You could kill the process(es) out of their mathematical slumber in no time. That might be a safer way out than a host of leaky flags ;). Good luck wielding the atomic power, then. — kuroi neko, Feb 01 '14 at 19:31

Is dereferencing a READ ONLY non-atomic pointer to an atomic object in different threads safe?

4 Answers4