0

I have the following two member functions in a class. The _mtxWrite is a mutex object I use to make the write function thread safe. During heavy load, sometimes the writeHandler doesn't get called. Then _mtxWrite doesn't get released, resulting in a deadlock. What is the best way detect the situation and resolve the deadlock?

template <class TSession>
void write(boost::shared_ptr<TSession> pSession, 
    boost::shared_ptr<CInProtocolBase> pMessage)
{
    std::vector<unsigned char>* pData = new std::vector<unsigned char>;
    pMessage->serialize(*pData); 

    _mtxWrite.lock();
    boost::asio::async_write(_socket,boost::asio::buffer(&pData->at(0), 
        pData->size()),
        boost::bind(&this_type::writeHandler<TSession>,
        shared_from_this(),pSession,pData,boost::asio::placeholders::error));
}

template <class TSession>
void writeHandler(boost::shared_ptr<TSession> pSession, 
    std::vector<unsigned char>* pData,const boost::system::error_code& ec)
{
    delete pData;
    _mtxWrite.unlock();

    if(ec)
    {
        _socket.get_io_service().post(boost::bind(&TSession::errorHandler,
            pSession,ec));
    }
}
Tanner Sansbury
  • 51,153
  • 9
  • 112
  • 169
Sharath
  • 1,627
  • 2
  • 18
  • 34
  • Why does your `writeHandler()` not have a `bytes_transferred` argument? And what is the exact type of `_mtxWrite`? If it's a recursive mutex then a call in the same thread will sail through that `lock()` and post an overlapping (and disallowed) `async_write()`. If it's not a recursive mutex then a call in the same thread will throw an exception if the lock was already acquired. – rhashimoto May 27 '14 at 18:19
  • 2
    Enabling [handler tracking](http://www.boost.org/doc/libs/release/doc/html/boost_asio/overview/core/handler_tracking.html) may provide some insight into what is occurring. – Tanner Sansbury May 27 '14 at 19:41
  • @rhashimoto: It is boost::mutex, not recursive. I am not using bytes_transferred for now, loss of a message is ok, but not deadlock. – Sharath May 28 '14 at 01:18

1 Answers1

2

Your code is locking a mutex where async_write() is called and unlocking it in the handler. If these operations happen in different threads then this will violate the unlock() preconditions (the current thread owns the mutex). This may be causing your deadlock. In general, separated locking and unlocking is usually an indication of suboptimal design.

A better way of handling asynchronous writes would be to use a queue. The logic would look something like:

Write function:

  • Lock mutex.
  • Push buffer onto queue.
  • If queue contains exactly one element, call async_write on the front element.
  • Unlock mutex (ideally via a scoped_lock falling out of scope).

Write handler:

  • Lock mutex.
  • Pop buffer from queue (and free it).
  • If the queue is not empty, call async_write on the front element.
  • Unlock mutex (ideally via a scoped_lock falling out of scope).

With this approach locking is localized and there is no opportunity for deadlock. @TannerSansbury's pointer in the comments to using a strand for synchronization instead of a mutex will work similarly - it implicitly provides exclusion for any function run in the strand.

I've never had a case of a write handler not being called that wasn't eventually determined to be my own bug, but you can use a boost::asio::deadline_timer to watch for this by setting the timer when you call async_write() and canceling it in the write handler.

rhashimoto
  • 15,650
  • 2
  • 52
  • 80
  • 1
    An example of an asynchronous queuing write chain synchronized with a `strand` is shown [here](http://stackoverflow.com/a/7756894/1053968). – Tanner Sansbury May 27 '14 at 19:39
  • Yes, I am unlocking in a different thread. Didn't realise that would be a problem. – Sharath May 28 '14 at 02:08
  • @SharathKShetty Mutex locking and unlocking is generally required to take place on the same thread. Your original code looks more like how a semaphore would be used - perhaps you're more familiar with that synchronization primitive. [This question](http://stackoverflow.com/questions/62814/difference-between-binary-semaphore-and-mutex) might help to explain the difference. – rhashimoto May 28 '14 at 18:09
  • I am not new to threading/mutexes, but relatively new to boost. Since this code worked most of the time, I thought it was okay. I think the io_service was reusing the same worker thread for the handler, so it worked most of the time, hence I didn't think about it. But when different thread was used, it went kaput. So the handler must have been called, but from a different worker thread. Is that a good explanation for what must have happened? – Sharath May 28 '14 at 19:21
  • 1
    @SharathKShetty I'd say it's a conceivable explanation (and the only one I currently see) - I don't think you mention your OS and I wouldn't know the implementation of `boost::mutex` well enough anyway to say for sure. It's something that needed to be fixed, in any case. You also should put the `bytes_transferred` argument in your handler and placeholder in your handler binding. – rhashimoto May 28 '14 at 19:42
  • I implemented the code as suggested in your answer, and it works beautifully. Thanks a lot. Regarding the bytes_transferred parameter, I added it. But I am not sure what is the right thing to do if the bytes_transferred value doesn't match the bytes I tried to write. Retry writing the remaining bytes? How common or rare is this situation? – Sharath May 29 '14 at 15:28
  • For `async_write()`, `bytes_transferred` will [either be the total amount of data or an error will be set](http://www.boost.org/doc/libs/1_55_0/doc/html/boost_asio/reference/async_write/overload1.html). There can be no partial transfer without an error condition. The handler has that argument because it is a general form also used by other write functions that can have partial transfers, like `basic_stream_socket::async_write_some()`. – rhashimoto May 29 '14 at 15:37