Boost.Asio async_compose gets stuck under load

Question

I inherited a codebase that was working on boost 1.75. This code was working without issues on boost 1.75:

using CompletionTokenType = boost::asio::yield_context;
using FunctionType = void(boost::system::error_code);
using AsyncResultType = boost::asio::async_result<CompletionTokenType, FunctionType>;
using HandlerType = typename AsyncResultType::completion_handler_type;

[[maybe_unused]] ResultOrErrorType
read(CompletionTokenType token, StatementType const& statement)
{
    auto handler = HandlerType{token};
    auto result = AsyncResultType{handler};
    
    auto const future = handle_.get().asyncExecute(statement, [handler](auto const&) mutable {
        boost::asio::post(boost::asio::get_associated_executor(handler), [handler]() mutable {
            handler(boost::system::error_code{});
        });
    });

    result.get(); // suspends coroutine until handler called

    if (auto res = future.get(); res) {
        return res;
    } else {
        // handle error
    }
}

Then we were forced to migrate to boost 1.82 and with that came the necessity to use async_compose/async_initiate (as far as I understand).

The code had to be changed to the following mess.

We now have to make a shared_ptr out of self because the callback given to asyncExecute is a std::function that copies the callback. Sadly, self is non-copyable, unlike handler was before.
"Callback hell" instead of slightly more sequential code often associated with coroutines

[[maybe_unused]] ResultOrErrorType
read(CompletionTokenType token, StatementType const& statement)
{
    auto future = std::optional<FutureWithCallbackType>{};
    
    auto init = [this, &statement, &future]<typename Self>(Self& self) {
        future.emplace(handle_.get().asyncExecute(statement, [sself = std::make_shared<Self>(std::move(self))](auto&& data) mutable {
                auto executor = asio::get_associated_executor(*sself);
                asio::post(executor, [data = std::move(data), sself = std::move(sself)]() mutable {
                    sself->complete(std::move(data));
                });
        }));
    };

    auto res = asio::async_compose<CompletionTokenType, void(ResultOrErrorType)>(init, token, boost::asio::get_associated_executor(token));

    if (res) {
        return res;
    } else {
        // handle error
    }
}

This new code works correctly most of the times but eventually gets stuck after entering async_compose. I don't see any logs from the handle error path and in general the DB code is assumed to work correctly (from multiple other tests). There is always exactly one callback firing from the asyncExecute call.

The code above is slightly simplified. For example, on MacOS the above code works as is but on Linux(gcc) we need to add an explicit work object to the inner post, otherwise the application crashes. Also on MacOS it's ok to call sself->complete without wrapping it with a post but on Linux once again that makes the application crash.

All this suggests that something is not quite right with this code. Please help me to understand what we are doing wrong. Any ideas/pointers will be highly appreciated.

If there is a way to get the old code working again it's also a viable solution. However, i was not able to do that as completion_handler_type appears to be gone even tho the docs for boost 1.82 claim to still expose it.

Can you try to make a more self-contained example? I'm happy to help with the question, but I really think it is more constructive to ask for help with migrating between 1.72 and 1.75 instead of asking "I'm doing XYZ, what's wrong with it" - because at least from the current descriptions XYZ comtains more than a few steps that are... just unnecessary and probably bad ideas. — sehe, Aug 30 '23 at 14:05
Also in the original code I'm trying to understand `result.get(); // suspends coroutine`. For one thing, there is no coroutine in sight, and e.g. `std::future::get()` doesn't suspend anything: it just blocks (and synchronizes the current thread on a promise state). Of course, we don't know that type of `handle_` or `handle_.get()` let alone the type of `future`, so it's possible that the namings of things are just red herrings. — sehe, Aug 30 '23 at 14:05
@sehe i will try to make a more self-contained example, thanks. for now tho, the future type is not a `std::future`, it's a custom type that stores the given lambda as a `std::function` and the underlying implementation calls back when the data is ready. — Alex Kremer, Aug 30 '23 at 14:07
(Apropos of nothing: a problem with using `std::function` to wrap callbacks/completion tokens is that it hides associated executors: compare e.g. https://stackoverflow.com/questions/71475533/boostasiobind-executor-does-not-execute-in-strand/71475865#71475865) — sehe, Aug 30 '23 at 14:16
My understanding is that i'm explicit about which executor to post on so that should not be a problem. Besides, the `asyncExecute` is unaware of asio and should not have to be. Unfortunately, it's a wrapper on top of a C API so i don't really have a choice (do i?) but to use `std::function`. The `std::function` is then wrapped in a `std::unique_ptr` (so the "future" can survive `std::move`) and stored in the `FutureWithCallback`. The actual pointer under the `unique_ptr` is then given to the C API as "user-data" for the callback setter function. Not ideal but don't have a better option afaik. — Alex Kremer, Aug 30 '23 at 21:18
If you're being explicit about the wrong executor, that doesn't solve much (`get_associated_executor` will happily return `system_executor{}`). Of course, we don't see the code that uses `std::function`, which is why I only mention it "apropos of nothing" and asked for more details :) — sehe, Aug 30 '23 at 21:28
@sehe in case you want to see the actual code: https://github.com/XRPLF/clio/blob/develop/src/data/cassandra/impl/Future.h and also there is a discussion on #boost-beast in cppslack if you wanna take a look there. i'd also love to speak to you directly if that helps. Any attempt at creating an example so far lead to a working application that does not produce the same issues :D — Alex Kremer, Aug 30 '23 at 21:47
You can still share the minimized code - perhaps we can spot an issue that only rarely leads to the symptoms your production version does encounter — sehe, Aug 31 '23 at 00:25
@sehe Absolutely! I was just trying to actually reproduce the bug without the DB but I CAN NOT! Here is some code that attempts to fake some DB parts, works for the fake but fails for the real thing. At this point it's looking like somehow the DB (threads?) are doing something that leads to this.. but i really hope that you may see something else that is actually in the way we use `async_compose`. Example is by no means small, sorry about that. https://godbolt.org/z/MY11jfvYv . I included parts of the code that setup the DB for completeness. Also there are links to real code here and there. — Alex Kremer, Aug 31 '23 at 01:58
Another thing i noticed, if i force it to use `asio::system_executor` then it seems to work fine. This is likely only on MacOS tho. as i believe by default it will just call the handler in place and i already seen this crash on linux :) Thank you for looking into this @sehe! — Alex Kremer, Aug 31 '23 at 02:00
`system_executor` is likely just masking the problem because it will use the default thread_pool, making it harder to block all service threads. It does confirm that something blocks the service thread(s) when the problem occurs though. Will have a look tomorrow — sehe, Aug 31 '23 at 02:05
No matter how much i try to reproduce this issue by using a fake thread pool it never fails. The real DB code still locks. I did even more tests to verify that the callback is indeed called by the DB but the inner `post` does not finish and so `complete` is never called. How could a 3rd party thread(pool) interfere with asio? How can it be different from any of the other async operations i'm trying to fake with? It makes no logical sense to make a difference.. or am i missing something? — Alex Kremer, Sep 02 '23 at 10:20
Is there a github issue with a reproducing load on [clio](https://github.com/XRPLF/clio) perhaps? I don't feel "blindly attacking" the code base will be effective, but if we can zoom on the repro (ideally even comparing with the 1.75 behaviour) that could be fairly quick, assuming that `clio` is something anyone can run for the repro [And yes, we're both obviously missing something. That's the good news I guess] — sehe, Sep 02 '23 at 11:54

Boost.Asio async_compose gets stuck under load

0 Answers0