-1

Problem

  • Assume static std::atomic<double> dataX =0.0; defined in a cpp(Module)
  • In that module two separate functions have been defined.
  • These functions will be invoked and run by two threads independently,under the hood(within functions) two continuous looping processes are defined there; until
    some break statement gets called(using predicate).
  • When the two functions are running one function responsible for writing data todataX and other function responsible for reading from dataX and then post it to a container.
  • These two execution of functions happen under two threads as described and each thread gets sleep for very tiny millisecond(1ms) of duration.
  • Therefore both Read/Write operations for dataX coupled and gets called within very less amount of CPU cycle time.

How does C/C++ run-time behave such condition for std::atomic<double> ?

Can we have a guarantee about both written and read values of std::atomic<double> dataX maintained such condition?

  • What kind of guarantee do you want? It's always atomic; the reader will always see a value that was written by the writer. You will never see a "garbage" value that has a mix of bytes from two separate stores. Other than that, there is no synchronization so you might as well use `memory_order_relaxed` to make the stores cheaper. As far as performance, for some experiments on x86 with having one thread spam writes as fast as it can, and another thread read the same location as fast as it can, see [this Q&A](https://stackoverflow.com/q/45602699/224132). – Peter Cordes Oct 30 '17 at 06:40
  • Specifically for `double` as opposed to `uint64_t`, gcc and clang often make sub-optimal code for `atomic`, but it's only a few extra ALU operations moving data to between integer registers and FP. https://stackoverflow.com/questions/45055402/atomic-double-floating-point-or-sse-avx-vector-load-store-on-x86-64. – Peter Cordes Oct 30 '17 at 06:42
  • @PeterCordes I got unexpected values from that double `dataX` therefore I used std::try_lock and unlock blocks when store and load happens .. also used `memory_order_release` and `memory_order_acquire` respectively in order to prevent reordering .. using locks unnecessary here? –  Oct 30 '17 at 06:44
  • 1
    Reordering with what? Do you mean that you "lose" some updates because the writer overwrites a value before the reader has seen it? Yes of course that happens. If that's no ok, probably a single-producer single-consumer queue (circular buffer) is the best choice here; that can be wait-free with no locking, and doesn't block the writing thread if the reader isn't ready yet, but the reader can catch up. With a power-of-2 size, it's very fast. Fall back to a condition var or something when it's full or empty. See https://stackoverflow.com/questions/990627/ for an example. – Peter Cordes Oct 30 '17 at 06:51
  • @PeterCordes that what I expected.. thank you.. One thing using try_lock with atomic is bad ? –  Oct 30 '17 at 06:57
  • If you're using a lock to protect the shared variable, there's no point making it `atomic<>`. I'd suggest using a lock-free queue, though. With a fixed-size queue, the writer only has to wait for a lock if the reader doesn't keep up, and should be *very* low overhead. There are some good implementations. – Peter Cordes Oct 30 '17 at 07:02
  • @PeterCordes if you don't mind can you suggest a link to a better implementation for Windows/Visual C++ ? This link better? https://software.intel.com/en-us/articles/single-producer-single-consumer-queue –  Oct 30 '17 at 07:04
  • Boost has one. I assume the implementation is decent. http://www.boost.org/doc/libs/1_65_1/doc/html/boost/lockfree/spsc_queue.html. Use `boost::lockfree::capacity<>` to set the size at compile-time. – Peter Cordes Oct 30 '17 at 07:06
  • @PeterCordes cannot use boost .. :( how about that Intel link ? –  Oct 30 '17 at 07:07
  • IDK what implementations are most efficient. Look at them yourself to see if they're implemented efficiently. – Peter Cordes Oct 30 '17 at 07:08

2 Answers2

1

It depends. When you perform atomic operations using the correct acquire-release-semantics or if they are sequentially consistent, every write before (and including) this atomic write operation will be visible to any thread reading on that atomic variable.

What is not guaranteed is that your reader threads sees every change on your atomic. It might be well possible that the writer threads writes multiple times to the atomic before the reader thread got any chance to read it out.

In general, what you are trying to accomplish sounds suspicious. You should consider using one of the usual and well-tested standard synchroniziation objects, such as std::condition_variable.

Jodocus
  • 7,493
  • 1
  • 29
  • 45
  • Thanks for the quick reply.. If I use mutex lock/unlock is it better? or `std::condition_variable` is better than locking? –  Oct 30 '17 at 06:34
  • Also other thing,these two execution should be independent,I used `std::try_lock` with atomic double. only try and unlock .. if I used std::cv shouldn't I used wait procedure? –  Oct 30 '17 at 06:38
0

Writing an atomic mean your thread must have sole ownership of the cacheline where your data resides. So the CPU (assuming and intel/amd) sends out a request for write and must wait for all possible parts of the computer to release all copies.

This could be very slow as the cacheline currently could reside in peripherals such as a graphics card. In this case only your writer and reader are only the two processes which makes it a bit more likely that it will be somewhere in the cache and as such only require a delay of ~12-~150 cycles every 1ms.

Surt
  • 15,501
  • 3
  • 23
  • 39
  • you are suggesting a delay between write and read operations? –  Oct 30 '17 at 06:47
  • GPUs don't participate in cache-coherency protocols. This is why video memory is normally mapped USWC, not WB. It can be very slow with multi-socket CPUs, though, if a core on another socket owns the cache line. – Peter Cordes Oct 30 '17 at 06:56
  • 1
    With a write-only writer and a read-only reader, the writer's cache line will flip between Modified and Shared states. The reader's cache line will flip between Invalid and Shared. It's the reader that has to pay most of the cost waiting for the Invalid cache line. (In Intel CPUs, this requires the writer core to write-back to L2/L3, I think.) Anyway, this is the case with *any* store, not just an atomic one. The issue is the sharing causes evictions and memory-order mis-speculation. – Peter Cordes Oct 30 '17 at 06:59
  • 1
    The scenario you describe usually applies to seq/cst writes only (on a TSO). Those (usually) set a full fence and require the store buffer to be flushed. A write using a weaker ordering does not have to wait for a cache line in Exclusive/Modified state. The written value goes into the store buffer and the thread can move on; similar to non-atomic writes. – LWimsey Oct 30 '17 at 13:42
  • @PeterCordes, I was pretty sure that was what the Nvidia guy said at CppCon2017? for his new architecture, guess I have to see it again. – Surt Oct 30 '17 at 20:50
  • @PeterCordes I was convinced it was the writer who had to pay most as he has to be sure there are no copies on any unit in the cache-coherency realm, where the reader just has to wait for the first copy of the data he gets. – Surt Oct 30 '17 at 20:52
  • Oh, yes seq_cst writes will suck for the writer too, because later loads can't happen until it commits. As LWimsey says, the store buffer hides most of the penalty of waiting for the RFO replies to let it flip from Shared to Modified. So it's probably more like near-equal cost for reader/writer with seq_cst. For anything weaker (release or relaxed), the writer doesn't have to wait for the store to commit. – Peter Cordes Oct 31 '17 at 03:29