Does calling `into_inner()` on an atomic take into account all the relaxed writes?

Question

Does into_inner() return all the relaxed writes in this example program? If so, which concept guarantees this?

extern crate crossbeam;

use std::sync::atomic::{AtomicUsize, Ordering};

fn main() {
    let thread_count = 10;
    let increments_per_thread = 100000;
    let i = AtomicUsize::new(0);

    crossbeam::scope(|scope| {
        for _ in 0..thread_count {
            scope.spawn(|| {
                for _ in 0..increments_per_thread {
                    i.fetch_add(1, Ordering::Relaxed);
                }
            });
        }
    });

    println!(
        "Result of {}*{} increments: {}",
        thread_count,
        increments_per_thread,
        i.into_inner()
    );
}

(https://play.rust-lang.org/?gist=96f49f8eb31a6788b970cf20ec94f800&version=stable)

I understand that crossbeam guarantees that all threads are finished and since the ownership goes back to the main thread, I also understand that there will be no outstanding borrows, but the way I see it, there could still be outstanding pending writes, if not on the CPUs, then in the caches.

Which concept guarantees that all writes are finished and all caches are synced back to the main thread when into_inner() is called? Is it possible to lose writes?

user4815162342 · Accepted Answer · 2017-10-18T14:13:28.670

5

Does into_inner() return all the relaxed writes in this example program? If so, which concept guarantees this?

It's not into_inner that guarantees it, it's join.

What into_inner guarantees is that either some synchronization has been performed since the final concurrent write (join of thread, last Arc having been dropped and unwrapped with try_unwrap, etc.), or the atomic was never sent to another thread in the first place. Either case is sufficient to make the read data-race-free.

Crossbeam documentation is explicit about using join at the end of a scope:

This [the thread being guaranteed to terminate] is ensured by having the parent thread join on the child thread before the scope exits.

Regarding losing writes:

Which concept guarantees that all writes are finished and all caches are synced back to the main thread when into_inner() is called? Is it possible to lose writes?

As stated in various places in the documentation, Rust inherits the C++ memory model for atomics. In C++11 and later, the completion of a thread synchronizes with the corresponding successful return from join. This means that by the time join completes, all actions performed by the joined thread must be visible to the thread that called join, so it is not possible to lose writes in this scenario.

In terms of atomics, you can think of a join as an acquire read of an atomic that the thread performed a release store on just before it finished executing.

edited Oct 18 '17 at 14:13

answered Oct 16 '17 at 19:05

user4815162342

141,790
18
296
355

I was trying to answer this question, but was missing the evidence that there is a synchronize-with relation upon a thread join. Can you provide additional sources on this in specific? Am I searching with the wrong key terms? – E_net4 Oct 16 '17 at 20:38
@E_net4 It's documented [in C++](http://en.cppreference.com/w/cpp/thread/thread/join), immediately in the second sentence. (Similar wording is also present in the C++ standard.) My reasoning is that this applies to Rust is because: a) as noted in the answer, Rust documents that it inherits C++'s memory model, and b) without such a fence, `into_inner` would constitute a data race and would definitely not be safe in the Rust sense. – user4815162342 Oct 16 '17 at 20:43
Yeah, the documentation on `join` is nice. I certainly understood a) and b), but if we consider [this version](https://play.rust-lang.org/?gist=65ff0be362120d13bbd735f306a31acd&version=stable) of the program, where the value is not consumed with `into_inner`, there had to be some mechanism that makes the main thread see the outcome of all counter incrementations. – E_net4 Oct 16 '17 at 20:49
1

Thank you. I also found https://stackoverflow.com/a/43102737/83741 and http://en.cppreference.com/w/cpp/atomic/atomic_thread_fence informative. The sychronizes-with relation is the key detail which I had missed. Your linked article is great. Thanks again. – Alexander Torstling Oct 20 '17 at 06:53
I used the ASM view in the playground to look at the specific instructions generated, and it seems as if https://doc.rust-lang.org/core/sync/atomic/fn.fence.html is used, which results in an mfence instruction on x86: http://x86.renejeschke.de/html/file_module_x86_id_170.html – Alexander Torstling Oct 20 '17 at 12:28
1

@AlexanderTorstling I've now submitted [an issue](https://github.com/rust-lang/rust/issues/45467) for this to be documented explicitly. – user4815162342 Oct 23 '17 at 10:57

E_net4 · Answer 2 · 2017-10-17T09:32:26.527

I will include this answer as a potential complement to the other two.

The kind of inconsistency that was mentioned, namely whether some writes could be missing before the final reading of the counter, is not possible here. It would have been undefined behaviour if writes to a value could be postponed until after its consumption with into_inner. However, there are no unexpected race conditions in this program, even without the counter being consumed with into_inner, and even without the help of crossbeam scopes.

Let us write a new version of the program without crossbeam scopes and where the counter is not consumed (Playground):

let thread_count = 10;
let increments_per_thread = 100000;
let i = Arc::new(AtomicUsize::new(0));
let threads: Vec<_> = (0..thread_count)
    .map(|_| {
        let i = i.clone();
        thread::spawn(move || for _ in 0..increments_per_thread {
            i.fetch_add(1, Ordering::Relaxed);
        })
    })
    .collect();

for t in threads {
    t.join().unwrap();
}

println!(
    "Result of {}*{} increments: {}",
    thread_count,
    increments_per_thread,
    i.load(Ordering::Relaxed)
);

This version still works pretty well! Why? Because a synchronizes-with relation is established between the ending thread and its corresponding join. And so, as well explained in a separate answer, all actions performed by the joined thread must be visible to the caller thread.

One could probably also wonder whether even the relaxed memory ordering constraint is sufficient to guarantee that the full program behaves as expected. This part is addressed by the Rust Nomicon, emphasis mine:

Relaxed accesses are the absolute weakest. They can be freely re-ordered and provide no happens-before relationship. Still, relaxed operations are still atomic. That is, they don't count as data accesses and any read-modify-write operations done to them occur atomically. Relaxed operations are appropriate for things that you definitely want to happen, but don't particularly otherwise care about. For instance, incrementing a counter can be safely done by multiple threads using a relaxed fetch_add if you're not using the counter to synchronize any other accesses.

The mentioned use case is exactly what we are doing here. Each thread is not required to observe the incremented counter in order to make decisions, and yet all operations are atomic. In the end, the thread joins synchronize with the main thread, thus implying a happens-before relation, and guaranteeing that the operations are made visible there. As Rust adopts the same memory model as C++11's (this is implemented by LLVM internally), we can see regarding the C++ std::thread::join function that "The completion of the thread identified by *this synchronizes with the corresponding successful return". In fact, the very same example in C++ is available in cppreference.com as part of the explanation on the relaxed memory order constraint:

#include <vector>
#include <iostream>
#include <thread>
#include <atomic>

std::atomic<int> cnt = {0};

void f()
{
    for (int n = 0; n < 1000; ++n) {
        cnt.fetch_add(1, std::memory_order_relaxed);
    }
}

int main()
{
    std::vector<std::thread> v;
    for (int n = 0; n < 10; ++n) {
        v.emplace_back(f);
    }
    for (auto& t : v) {
        t.join();
    }
    std::cout << "Final counter value is " << cnt << '\n';
}

score 0 · Answer 3 · answered Oct 16 '17 at 17:48

0

The fact that you can call into_inner (which consumes the AtomicUsize) means that there are no more borrows on that backing storage.

Each fetch_add is an atomic with the Relaxed ordering, so once the threads are complete there shouldn't be any thing that changes it (if so, then there's a bug in crossbeam).

See the description on into_inner for more info

answered Oct 16 '17 at 17:48

Timidger

1,199
11
15

This doesn't quite answer where the code which syncs caches is resided. Is there a memory barrier when exiting a crossbeam scope? – Alexander Torstling Oct 16 '17 at 18:17
According to [this section of the nomicon](https://doc.rust-lang.org/nomicon/atomics.html#relaxed), these writes are still atomic so there shouldn't be anything in the cache that will cause any issues. – Timidger Oct 16 '17 at 18:45
1

It also says that Relaxed ordering establishes no happens-before relationships. The fact that writes are atomic doesn't mean that they will be seen by other threads. – Alexander Torstling Oct 16 '17 at 19:00
1

@AlexanderTorstling Relaxed ordering doesn't automatically establish happens-before, but you can still introduce an explicit fence. For example, `x = bla.load(Relaxed); if !x.is_null() { fence(Acquire); }` is a valid pattern. [This atomic library](http://preshing.com/20130505/introducing-mintomic-a-small-portable-lock-free-api/) is completely based on relaxed reads/writes and explicit fences. – user4815162342 Oct 16 '17 at 19:07

Does calling `into_inner()` on an atomic take into account all the relaxed writes?

3 Answers3