ASIO: getting deadlock when using several strands and threads with one io_context

Question

I am compiling with Mingw64 on Windows the latest version of ASIO.

I have a sandbox code for accepting tcp connections. I use one context, a strand per acceptor and a socket and 2 threads (I have read in the documentation that posting into two different strands does not guarantee concurrent invocation). For some reason I get a deadlock at the end of execution and I don't know why it happens. It does not happen if:

I use 1 thread and one common context
Sometimes when I use 1 context and 2 threads without strands
I use 2 different contexts with 2 different threads without strands
when some time passes between std::future synchronization and a request to stop a server
Sometimes when I post acceptor.cancel() to its executor explicitly

Deadlock also does not happen if I close the acceptor.

I have failed to find any relevant info in the documentation which might explain the reason for such behavior. And I don't want to ignore it since it might result in unpredictable problems.

Here is my sandbox code:


#include <asio.hpp>
#include <iostream>
#include <sstream>
#include <functional>

constexpr const char localhost[] = "127.0.0.1";
constexpr unsigned short port = 12000;

void runContext(asio::io_context &io_context)
{
    std::string threadId{};
    std::stringstream ss;
    ss << std::this_thread::get_id();
    ss >> threadId;
    std::cout << std::string("New thread for asio context: ")
                 + threadId + "\n";
    std::cout.flush();

    io_context.run();

    std::cout << std::string("Stopping thread: ")
                 + threadId + "\n";
    std::cout.flush();
};

class server
{
public:

    template<typename Executor>
    explicit server(Executor &executor)
            : acceptor_(executor)
    {
        using asio::ip::tcp;
        auto endpoint = tcp::endpoint(asio::ip::make_address_v4(localhost),
                                      port);
        acceptor_.open(endpoint.protocol());
        acceptor_.set_option(tcp::acceptor::reuse_address(true));
        acceptor_.bind(endpoint);
        acceptor_.listen();
    }

    void startAccepting()
    {
        acceptor_.async_accept(
                [this](const asio::error_code &errorCode,
                       asio::ip::tcp::socket peer)
                {
                    if (!errorCode)
                    {
                        startAccepting();
//                        std::cout << "Connection accepted\n";
                    }
                    if (errorCode == asio::error::operation_aborted)
                    {
//                        std::cout << "Stopped accepting connections\n";
                        return;
                    }
                });
    }

    void startRejecting()
    {
        // TODO: how to reject?
    }

    void stop()
    {
        // asio::post(acceptor_.get_executor(), [this](){acceptor_.cancel();}); // this line fixes deadlock
        acceptor_.cancel();
        // acceptor_.close(); // this line also fixes deadlock
    }

private:
    asio::ip::tcp::acceptor acceptor_;
};

int main()
{
    setvbuf(stdout, NULL, _IONBF, 0);
    asio::io_context context;

    // run server
    auto serverStrand = asio::make_strand(context);
    server server{serverStrand};
    server.startAccepting();

    // run client
    auto clientStrand = asio::make_strand(context);
    asio::ip::tcp::socket socket{clientStrand};

    size_t attempts = 1;
    auto endpoint = asio::ip::tcp::endpoint(
            asio::ip::make_address_v4(localhost), port);

    std::future<void> res = socket.async_connect(endpoint, asio::use_future);

    std::future<void> runningContexts[] = {
            std::async(std::launch::async, runContext, std::ref(context)),
            std::async(std::launch::async, runContext, std::ref(context))
    };

    res.get();
    server.stop();
    std::cout << "Server has been requested to stop" << std::endl;

    return 0;
}

UPDATE
According to the sehe's answer I am getting a deadlock, because when server.stop() is invoked, completion handler for successful acception has been already posted but due to cancellation is never invoked, which causes a context to have pending work, hence a deadlock at the end (if I understood correctly). The things I still don't understand are:

There is a separate strand for the server which (according to specification) enforces acceptor's commands to be invoked non-concurrently and in FIFO order. Handlers with no provided executors also have to be handled in the same thread. There's nothing about thread safety of acceptor::cancel() method in documentation, though distinct acceptor objects are safe. So I assumed that it is thread safe (no data races possible within one strand). @sehe's code does not cause deadlock in case the cancel is explicitly posted into the acceptor's thread via asio::post. For 500 invokation there were no deadlocks:

test 499
Awaiting client
New thread 3
New thread 2
Completed client
Server stopped
Accept: 127.0.0.1:14475
Accept: The I/O operation has been aborted because of either a thread exit or an application request.
Stopping thread: 2
Stopping thread: 3
Everyting shutdown

However, if I remove printing code before synchronization and stop() which causes a delay, a dead lock is easily achievable:

PS C:\dev\builds\asio_connection_logic\Release-MinGW-w64\bin> for ($i = 0; $i -lt 500; $i++){
>> Write-Output "
>> test $i"
>> .\sb.sf_example.exe}

test 0
New thread 2
New thread 3
Server stopped
Accept: 127.0.0.1:15160
PS C:\dev\builds\asio_connection_logic\Release-MinGW-w64\bin>
PS C:\dev\builds\asio_connection_logic\Release-MinGW-w64\bin> for ($i = 0; $i -lt 500; $i++){
>> Write-Output "
>> test $i"
>> .\sb.sf_example.exe}

test 0
New thread 2New thread 3

Server stopped
Accept: 127.0.0.1:15174
PS C:\dev\builds\asio_connection_logic\Release-MinGW-w64\bin> ^C

So, the conclusion is that no matter how you invoke acceptor.cancel(), you will get a deadlock.

Is there even a way to avoid deadlock for acceptor?

I had to edit your code to actually use boost asio. Did you mean to tag standalone Asio? — sehe, Apr 05 '21 at 15:14
@sehe I used standalone asio but have tagged boost since there are a lot more tags — Sergey Kolesnik, Apr 05 '21 at 15:16
Got it. I don't blame you. Luckily the diagnosis is the same regardless. — sehe, Apr 05 '21 at 15:18

sehe · Accepted Answer · 2021-04-05T20:15:06.563

I did some liposuction on the code and added some tracing:

Live On Wandbox

#include <boost/asio.hpp>
#include <iostream>
#include <functional>

constexpr unsigned short port = 12000;
namespace asio = boost::asio;
using boost::system::error_code;
using asio::ip::tcp;

void runContext(asio::io_context& io_context) {
    std::cout << "New thread " << std::this_thread::get_id() << std::endl;
    io_context.run();
    std::cout << "Stopping thread: " << std::this_thread::get_id() << std::endl;
}

class server {
  public:
    template <typename Executor>
    explicit server(Executor executor) : acceptor_(executor, {{}, port}) {
        acceptor_.set_option(tcp::acceptor::reuse_address(true));
    }

    void startAccepting() {
        acceptor_.listen();
        acceptLoop();
    }

    void stop()
    {
        //asio::post(acceptor_.get_executor(), [this]() {
            //acceptor_.cancel();
        //}); // this line fixes deadlock
        acceptor_.cancel();
        // acceptor_.close(); // this line also fixes deadlock
    }

  private:
    void acceptLoop() {
        acceptor_.async_accept([this](error_code errorCode, tcp::socket peer) {
            if (!errorCode) {
                std::cout << "Accept: " << peer.remote_endpoint() << std::endl;
                acceptLoop();
            } else {
                std::cout << "Accept: " << errorCode.message() << std::endl;
            }
        });
    }

    tcp::acceptor acceptor_;
};

int main() {
    setvbuf(stdout, NULL, _IONBF, 0);
    asio::io_context context;

    // run server
    server server{make_strand(context)};
    server.startAccepting();

    // run client
    tcp::socket socket{make_strand(context)};
    std::future<void> res = socket.async_connect({ {}, port}, asio::use_future);

    std::thread t1(runContext, std::ref(context));
    std::thread t2(runContext, std::ref(context));

    std::cout << "Awaiting client " << std::endl;

    res.get();

    std::cout << "Completed client" << std::endl;

    server.stop();

    std::cout << "Server stopped" << std::endl;

    t1.join();
    t2.join();
    std::cout << "Everyting shutdown" << std::endl;
}

As you can see a "correct" run prints:

Awaiting client 
New thread 140712013797120
Accept: New thread 140712005404416
127.0.0.1:57500
Completed client
Server stopped
Accept: Operation canceled
Stopping thread: 140712013797120
Stopping thread: 140712005404416
Everyting shutdown

However, an "incorrect run" prints:

New thread 140544269350656
Awaiting client 
New thread 140544260957952
Completed client
Server stopped
Accept: 127.0.0.1:48580
^C

The key is here:

Server stopped
Accept: 127.0.0.1:48580

The cancel comes before the accept. Which means that there was a race, where the completion handler for the async_accept was already in flight with non-failed errorCode. (In other words async_connect returned a bit earlier than the server was able to handle its async_accept completion.)

Indeed, posting on the strand is one way to fix it. This is because any handlers in flight will run before the cancellation, and if async-operation is pending it will be canceled.

NOTE: The other approach with acceptor_.close() invokes undefined behavior because of the data race on acceptor_ itself (which is not thread-safe).

Pet Peeve

Aside: a pet peeve:

One "problem" is std::launch::async. I don't use it. I think the behaviour of it is implementation-defined to an extent that makes it not very useful. Perhaps, use std::thread instead, since that's what you're after, here. On recent boost, use asio::thread_pool(2).

The answers here shed some light Why should I use std::async?

UPDATE TO EDITED QUESTION

You're right it has the same race condition - although without the UB, so that's nice.

Sidenote: You should probably stop calling it "deadlock" because there is none. It's a softlock, you're just waiting for something that never happens (async_connect). Deadlock is when multiple parties contend for locks in a way that can never be satisfied. This is just a soft-lock in the sense that a connection or even a network failure will allow the system to proceed.

So I also stripped the output but added BOOST_ASIO_ENABLE_HANDLER_TRACKING. The resulting picture confirms the exact explanation above:

From here the only obvious solutions would seem to be:

don't cancel the server from the client strand. By extension it would mean
- use the same strand for client and server (not feasible in inter-process comms)
- signal closure inside a connection, so that the message is received on a server strand anyways. This is the "make shutdown command part of the protocol" way.
post a close instead of a cancel, this actively makes any async_accept an error
alternatively, have a state machine in the server that manually checks whether we should still be accepting before initiating a new async_accept

Note that

with the .close() method in with code that uses the native handle it can lead to its own race condition (where another thread immediately opens a new file/socket with the reused filedescriptor and the native code doesn't notice it's talking to the wrong socket). To be honest this mostly seems a problem with stream sockets (not acceptors) and there it's very easy to fix with .shutdown(), so just to be aware.
I've never reached this problem in a lot of production use of ASIO. I guess, in practice your exact usecase (.cancel() perfectly timed together with a new accept completion) doesn't come up a lot
A similar use case that does come up quite a lot is with timers, which can also be tricky to handle race-free. There, the disambiguating factor is also extra state, in the form of basic_waitable_timer::expiry(). See e.g. Cancelling boost asio deadline timer safely

I used future synchronization before calling `server.stop()`. How come future returned before a connection has been accepted? — Sergey Kolesnik, Apr 05 '21 at 15:25
I use `std::async` because of RAII - it "joins" on destruction. And I couldn't find a clear distinction between `io_context` launched in several threads and `thread_pool` (does it implicitly use the specified amount of threads?). — Sergey Kolesnik, Apr 05 '21 at 15:29
What about a `strand`'s lifetime? From your post I figure it is not necessary to make it persistent if only one object uses it. — Sergey Kolesnik, Apr 05 '21 at 15:34
Indeed. In the executor model, executors reference a shared strand implementation inside the strand executor service. Likewise, pass executors by value always. — sehe, Apr 05 '21 at 15:38
On `io_context` vs `thread_pool`: see [e.g.](https://stackoverflow.com/questions/61864490/whats-the-difference-between-asioio-context-and-asiothread-pool) - yes it runs them _explicitly_ on n threads (either defaulted or specified). Except it will do so correctly, e.g. in the face of handler exceptions: https://stackoverflow.com/a/44500924/85371 — sehe, Apr 05 '21 at 15:40
"I used future synchronization before calling server.stop(). How come future returned before a connection has been accepted?" - I spelled it out both ways in the answer: you "synchronized" on the client-side accept, not on the server-side. — sehe, Apr 05 '21 at 15:41
I see. I did not get a deadlock if I waited for some time before calling `server.stop()`. But another question: how can a race condition occur if I use 2 strands and only one thread? — Sergey Kolesnik, Apr 05 '21 at 15:44
Same reason: there's a race condition where the completion has already been posted but not yet executed, and then you cancel it before it does get executed, having no effect. Mind you, I forgot to say explicitly but `acceptor_.cancel()` like `acceptor_.stop()` is also data race with only 1 service thread. In fact there's two threads: the service thread and the main thread. — sehe, Apr 05 '21 at 15:55
I updated my question. Explicitly posting `acceptor.cancel()` event does not prevent deadlocks. So I don't really see a solution for this problem, it seems to be a bug in asio itself. Correct me if I missed something. — Sergey Kolesnik, Apr 05 '21 at 18:48
You're right. I've analyzed some more and updated the answer with some thoughts/experience (also why I didn't realize this before). The same thing applies to timer cancellation as in e.g. https://stackoverflow.com/questions/43168199/cancelling-boost-asio-deadline-timer-safely/43169596#43169596 - you will notice the stark similarity. — sehe, Apr 05 '21 at 19:55
With post of `stop` instead of `cancel` I never had the hang manifest anymore, and the graph reflects it: https://i.imgur.com/xtA3paG.png notice that the "spurious" `async_accept` is failed with ec:9 (EBADF) — sehe, Apr 05 '21 at 19:59
I've come around a very dirty workaround which involves `atomic_bool` to stop the loop and asynchronous `cancel` invocation using `std::thread{[this](){acceptor_.cancel();}}.detach()`. It works, but it is rather ugly. I've peek through `cancel` implementation for Windows and found out that it uses `CancelIoEx` which is thread safe. I think the issue is with the Executor queue, but the code base is not easy to read or understand... I have opened an issue https://github.com/chriskohlhoff/asio/issues/806 — Sergey Kolesnik, Apr 05 '21 at 20:02
It's not an issue. Compare to the similar question about timer cancellation. That's just how asynchrony works: you get most of the timing/ordering issues you know from multi-threading. And yes the `atomic_bool` solution is basically what I suggested as "alternatively, have a state machine in the server that manually checks whether we should still be accepting before initiating a new async_accept" (it's now bullet 2. after I fixed the markdown) — sehe, Apr 05 '21 at 20:14

ASIO: getting deadlock when using several strands and threads with one io_context

1 Answers1

Pet Peeve

UPDATE TO EDITED QUESTION