boost: how to monitor status of mutex and force release on deadlock [2]

Question

I am trying to use the shared_lock and unique_lock libraries from boost to implement a basic reader-writer lock on a resource. However, some of the threads accessing the resource have the potential to simply crash. I want to create another process that, given a mutex, monitors the mutex and keep track of what processes locked the resource and how long each process have the lock. The process will also force a process to release its lock if it has the lock for more than a given period of time.

Despite that the boost locks are all scoped locks and will automatically unlock once it's out of the scope, it still doesn't solve my problem if the server crashes, thus sending SIGSEGV to the process and killing it. The killed process will not call any of its destructors and thus will not release any of its held resources.

One potential solution is to somehow put a timer on the lock so that the process is forced to release the lock after a given period of lock. Even though this goes against the concept of locking, it works in our case because we can guarantee that if any process holds the lock for more than, let's say 5 minutes, then it's pretty safe to say that the process is either killed or there is a deadlock situation.

Any suggestions on how to approach this problem is greatly appreciated!

My previous thread was closed due to "possible duplicate", but the stated duplicate question does not answer my question.

boost: how to monitor status of mutex and force release on deadlock

Re: "However, some of the threads accessing the resource have the potential to simply crash." - What's wrong with designing the threads so they don't crash? — In silico, Nov 12 '12 at 20:14
Instead of launching threads, how about launching processes. I am only familiar with this in a windows environment, but you could use a named mutex in this case and you would be able to detect the abandoned mutex state. — pstrjds, Nov 12 '12 at 20:16
It is difficult to understand how what you suggest, even if somehow implemented, would lead to an increase in overall reliability. If a thread/process 'crashes', the validity of any shared data is suspect, even if you could bodge some way of regaining access to it. — Martin James, Nov 12 '12 at 22:13
what are the resources which are held locked when the process sigsegv, and that you wish to unlock? — didierc, Nov 12 '12 at 23:38
Some threads `have the potential to simply crash`. You should concentrate on the origin of your problem: Make those threads working properly, particulary within the locked section. When a thread crashes within a "locked" section, what would be the state of the locked resource? Even if you manage to unlock the resource somehow from "outside", the state of the resource remains **unknown** and therefore the approach has little chance to be a big improvement because it may just force the next crash due to the **unknown** state of the resource. — Arno, Nov 13 '12 at 08:03
The state of a resource isn't necessarily unknown in all cases. Multiple non-atomic operations can be guarded by beginning and end atomic operations so that they collectively form a transaction. Then it will be possible for incomplete transactions to be detected when a lock is overriden. — Josh Heitzman, Nov 13 '12 at 15:46
Design your threads to simply not crash or at least limit the crash to appear only during unexpected events. If you expect a thread to crash there's something fundamentally wrong in your design. — Gianluca Ghettini, Nov 13 '12 at 15:51
@JoshHeitzman: So in fact you say that you just need to keep track up to which point in the code all went fine before a crash happens and the lock gets stuck. And then you can recover the resource depending on the information gathered by the `track keeping`. This soound like writing the code to do its job at least twice. Lot's of overhead to be prepared for a crash? Lots of more lines to prepare for a crash recovery? Those additions may make the code even more `crash likely`. I'm still convinced that the other way around is the better way: A little code as possible. — Arno, Nov 14 '12 at 10:06
> What's wrong with designing the threads so they don't crash? If the `mutex` is used in inter-process how can you guarantee the process won't crash? There could be a power outage. — pooya13, Apr 01 '21 at 05:18

Josh Heitzman · Answer 1 · 2012-11-13T15:40:30.813

Putting aside whether this is a good idea or not, you could roll your own mutex implementation that utilizes shared memory to store a timestamp, a process identifier, and a thread identifier.

When a thread wants to take a lock it will need to find an empty slot in the shared memory and use an atomic test and set operation, such as InterlockedCompareExchange on Windows, to set the process id if current value is the empty value. If the set doesn't occur it will need to start over. After getting the process id set, the thread will need to repeat the process for the thread identifier, and then do the same thing with the timestamp (it can't just set it though, it still needs to be done atomically).

The thread will then need to check all of the other filled slots to determine if it has the lowest timestamp. If not it needs to make note the slot that has the lowest time stamp, and poll it until it's either emptied, has a higher timestamp, or is has been timed out. Then rinse repeat until the thread has the slot with the oldest time stamp at which point the thread has acquired the lock.

If another slot has been timed out the thread should trigger the timeout handler (which may kill the other process or simply raise an exception in the thread with the lock) and then use atomic test and set operations to clear the slot.

When the thread with the lock unlocks it then uses atomic test and set operations to clear its slot.

Update: also ties among the lowest timestamps would need to be dealt with to avoid a possible deadlock and the handling of that would need to avoid creating a race condition.

Is this a `lock around the lock`? This may fail too and then? — Arno, Nov 13 '12 at 07:56
@Arno - yes. The OP plan and any solution to it sounds like bolting error-prone band-aid onto an app that is alredy stuft in an unknown state. The whole thing fills me with 'NO! NO! NO!'. — Martin James, Nov 13 '12 at 12:20
@Arno - no it is not. This is a custom mutex implementation that stands on its own without needing to utilize another lock in its implementation. It is designed to be robust against the thread holding it crashing or hanging to allow other threads to recover it. — Josh Heitzman, Nov 13 '12 at 15:38
@JoshHeitzman: `timeout handler which may kill the other process`. We don't care about this damn process which has crashed and has not released the lock. Just forget about it. Wasn't a nice process anyway, did crash, why care, we just kill it, its own fault. But after a little while we ask ourself why did we actually start that process and we may remember that the process had some purpose, and now? We are left with a pretty complicated method which allows us to kill a process when it has crashed? Is it right to kill a process when it has crashed and what do we expect to happen? — Arno, Nov 14 '12 at 10:16

score 0 · Answer 2 · edited Nov 28 '12 at 17:32

@Arno: I disagree that the software needs to be so robust that it should not crash in the first place. Fault-tolerence systems (think on the lines of 5 nines of availability), need to have checks in place recover in face of sudden termination of critical processes. Something on the lines of pthread_mutexattr_*robust

Saving the owner pid, the last used timestamp for the mutex should help in recovery.

boost: how to monitor status of mutex and force release on deadlock [2]

2 Answers2