Robust mutex in shared memory not so robust

Question

When using a pthread-based robust mutex via boost::interprocesss::managed_shared_memory object to signal from one process to another, I notice there are issues: a) depending on start-up order and/or b) a change in behaviour when processes are restarted. The crux of the problem is that under certain conditions, the signals (via condition variables) in my sample apps are not received.

I have published a (minimal) code sample in git - https://github.com/zerodefect/mutex_example . I have tried to keep the code sample as brief as possible, but it still spans a few files. I'm hoping it is acceptable to link to a repository in GitHub in this instance?

I have 2 processes - process_b:

while (true)
{
    // Notify 'Process A' every 2 seconds.
    std::this_thread::sleep_for(std::chrono::seconds(2));
    pthread_cond_signal(pCv);

    std::cout << "Info: Signaled" << std::endl;
}

which merely attempts to signal to process_a:

while (true)
{
    if (!timed_lock_mutex(pMutex, std::chrono::seconds(5)))
    {
        std::cout << "Warning: Mutex wait timeout." << std::endl;
        continue;
    }

    BOOST_SCOPE_EXIT(pMutex)
    {
        unlock_mutex(pMutex);
    } BOOST_SCOPE_EXIT_END

    if (!wait_for_cv(pCv, pMutex, std::chrono::seconds(10)))
    {
        std::cout << "Warning: Wait timeout!" << std::endl;
        continue;
    }

    std::cout << "Info: Received notification." << std::endl;
}

Problem Scenarios

Scenario 1:

Start process A
Start process B (signals not received)

Scenario 2:

Start process B
Start process A (works at this point)
Restart process B (signals stop being received)

Questions:

Am I using boost's managed_shared_memory object correctly?
Have I configured the mutex correctly?

Environment:

Linux via Ubuntu 18.04.3 LTS
GCC v8.3.0
Boost v1.55

Update: @Jorge Bellon identified an issue where the mutex/condition_variable were being initialized twice. After being resolved, the program now seizes in the CV When it locks up, the stack traces appear as:

process_a:

futex_wait 0x00007ffff7bc3602
futex_wait_simple 0x00007ffff7bc3602
__condvar_acquire_lock 0x00007ffff7bc3602
__condvar_cancel_waiting 0x00007ffff7bc3602
__pthread_cond_wait_common 0x00007ffff7bc40bd
__pthread_cond_timedwait 0x00007ffff7bc40bd
wait_until cv_utils.cpp:73
wait_for_cv cv_utils.cpp:93
main main_process_a.cpp:85
__libc_start_main 0x00007ffff6fe6b97
_start 0x000055555555734a

process_b:

futex_wait 0x00007ffff7bc44b0
futex_wait_simple 0x00007ffff7bc44b0
__condvar_quiesce_and_switch_g1 0x00007ffff7bc44b0
__pthread_cond_signal 0x00007ffff7bc44b0
main main_process_b.cpp:73
__libc_start_main 0x00007ffff6fe6b97
_start 0x00005555555573aa

Are you sure you need to initialize the mutex in both processes? This does not seem right to me. You can use an `std::atomic_flag::test_and_set` and then initialize the mutex if and only if the flag was not previously set. This can be the reason why restarting process B does not make it work. — Jorge Bellon, Jan 07 '20 at 09:15
Ah, that hadn't even crossed my mind! I'll give that a try. What makes it interesting though is that it does initially work depending on certain startup order, but I would have thought it would _always_ fail. — ZeroDefect, Jan 07 '20 at 10:11
[man page for pthread_mutex_init](https://linux.die.net/man/3/pthread_mutex_init) says that it might return `EBUSY` in that case, but the section _Tradeoff Between Error Checks and Performance Supported_ states that errors resulting from a wrong program may not get reported, so this might be a good explanation of why it does not return an error. — Jorge Bellon, Jan 07 '20 at 13:04
You are correct. The same needs to be done from condition variables too (according to docs). Made the changed and pushed, but still getting it locking up depending on order. :( — ZeroDefect, Jan 07 '20 at 16:27
I suggest using `strace` to see what is going on under the hood. If you can't get any progress, give a look to file locks (`flock`) or use pipes/fifos instead, which are more common and reliable. Linux pipes are implemented with shared memory. — Jorge Bellon, Jan 07 '20 at 16:40
If it were me I would not allow *either* process to do the construct - I would limit it to process_a .. and instead retry on the find in process_b .. that sort of eliminates the possibility of a race condition on create. Depending on the semantics, if the object is created persistently (ie not cleaned up when a process dies) you could use a separate process to do the initial create before either of these two are started. The main idea here is to limit create responsibilities to one entity to avoid races entirely - using a separate process to do the initial create would get 2 working reliably. — Andrew Atrens, Jan 08 '20 at 14:10

x00 · Accepted Answer · 2020-01-09T19:28:49.647

0

My guess is your code locks because you never destroy shared memory https://theboostcpplibraries.com/boost.interprocess-shared-memory

If remove() is never called, the shared memory continues to exist even if the program terminates. Whether or not the shared memory is automatically deleted depends on the underlying operating system. Windows and many Unix operating systems, including Linux, automatically delete shared memory once the system is restarted.

Thus Process A try to acquire a lock of condvar internal mutex in pthread_cond_wait call, but it is already locked in previous runs. And because you have no exit logic, you most definitely killed processes, thus never releasing the lock. Same goes for Process B.

The fact that the mutex your've created is a robust mutex is irrelevant. Because it is not it you are locked waiting for. ...But I'm actually not sure what are you waiting for. Not sure what are properties of condvar internal futex. Further investigation is needed. But judging by the observed behavior it is not robust.

And by the way, you get but do not use shared mutex in Process B. But maybe, just maybe, you should Calling pthread_cond_signal without locking mutex And one more thing: pthread_cond_timedwait can return EOWNERDEAD and you must check for that error in your wait_for_cv()

edited Jan 09 '20 at 19:28

answered Jan 09 '20 at 06:14

x00

13,643
3
16
40

Wait, so am I misunderstanding the use/purpose of a "robust mutex"? I thought the purpose of such a concept was that one could recover when a process goes down. If I remove the shared memory while any process is running, I'm then pulling the rug from under a process' feet. – ZeroDefect Jan 09 '20 at 13:44
You might be able to employ a signal handler to release shared mutex on crash/exit. Maybe 'robust' mutex already does this? – Andrew Atrens Jan 09 '20 at 16:45
1

Thanks @x00 - it looks like you're right. Torvald Riegel says, "POSIX (and glibc) do not specify (or provide) robust variants of condition variables..." https://sourceware.org/bugzilla/show_bug.cgi?id=21422 – ZeroDefect Jan 09 '20 at 22:16
@AndrewAtrens, that is an idea, but I might see if I can implement my own robust CV using semaphores as building blocks. – ZeroDefect Jan 09 '20 at 22:21
@ZeroDefect, thanx for the link. – x00 Jan 09 '20 at 22:27
There is this to look at too - https://www.boost.org/doc/libs/1_70_0/doc/html/boost/interprocess/named_mutex.html and this http://www.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setrobust.html – Andrew Atrens Jan 10 '20 at 16:31
@AndrewAtrens there is something interesting in these links... At least the phrase "This mutex can't be placed in shared memory" got my attention. But I don't really understand what were you trying to say? That phrase for example was about named_mutex, which is not used here. And nothing got my attention in pthread_mutexattr_getrobust docs. – x00 Jan 12 '20 at 05:05
well, my first thought was .. perhaps named mutex is what you want to use instead .. it seems to be about synchronizing processes and might already handle the corner cases that you are working through.. even if it’s not quite what you are looking for a peruse of the source code may be enlightening.. my second thought, without looking too closely was that a robust attribute exists, has some ‘robust’ effect and seems to be settable. Again maybe not exactly what you want, but again a peruse of the implementation may be helpful. – Andrew Atrens Jan 12 '20 at 06:13
Looking a bit more closely, it seems like the second approach (setting the mutex's robust attribute and updating your code to manage the-owner-has-died return code) could be made to work, (there is sample code here: http://man7.org/linux/man-pages/man3/pthread_mutexattr_setrobust.3.html ) but the first approach (using named_mutex provided by boost), might be easier. I guess it depends on whether this is more of a learning exercise for you, or if you are pushing on a deadline to get something working. :) Anyways, all the best. --Andrew – Andrew Atrens Jan 13 '20 at 16:29

Robust mutex in shared memory not so robust

1 Answers1