How do I recover a semaphore when the process that decremented it to zero crashes?

Question

I have multiple apps compiled with g++, running in Ubuntu. I'm using named semaphores to co-ordinate between different processes.

All works fine except in the following situation: If one of the processes calls sem_wait() or sem_timedwait() to decrement the semaphore and then crashes or is killed -9 before it gets a chance to call sem_post(), then from that moment on, the named semaphore is "unusable".

By "unusable", what I mean is the semaphore count is now zero, and the process that should have incremented it back to 1 has died or been killed.

I cannot find a sem_*() API that might tell me the process that last decremented it has crashed.

Am I missing an API somewhere?

Here is how I open the named semaphore:

sem_t *sem = sem_open( "/testing",
    O_CREAT     |   // create the semaphore if it does not already exist
    O_CLOEXEC   ,   // close on execute
    S_IRWXU     |   // permissions:  user
    S_IRWXG     |   // permissions:  group
    S_IRWXO     ,   // permissions:  other
    1           );  // initial value of the semaphore

Here is how I decrement it:

struct timespec timeout = { 0, 0 };
clock_gettime( CLOCK_REALTIME, &timeout );
timeout.tv_sec += 5;

if ( sem_timedwait( sem, &timeout ) )
{
    throw "timeout while waiting for semaphore";
}

Stéphane · Accepted Answer · 2022-08-07T16:41:19.570

54

Turns out there isn't a way to reliably recover the semaphore. Sure, anyone can post_sem() to the named semaphore to get the count to increase past zero again, but how to tell when such a recovery is needed? The API provided is too limited and doesn't indicate in any way when this has happened.

Beware of the ipc tools also available -- the common tools ipcmk, ipcrm, and ipcs are only for the outdated SysV semaphores. They specifically do not work with the new POSIX semaphores.

But it looks like there are other things that can be used to lock things, which the operating system does automatically release when an application dies in a way that cannot be caught in a signal handler. Two examples: a listening socket bound to a particular port, or a lock on a specific file.

I decided the lock on a file is the solution I needed. So instead of a sem_wait() and sem_post() call, I'm using:

lockf( fd, F_LOCK, 0 )

and

lockf( fd, F_ULOCK, 0 )

When the application exits in any way, the file is automatically closed which also releases the file lock. Other client apps waiting for the "semaphore" are then free to proceed as expected.

Thanks for the help, guys.

UPDATE:

12 years later, thought I should point out that posix mutexes do have a "robust" attribute. That way, if the owner of the mutex gets killed or exits, the next user to lock the mutex will get the non-error return value of EOWNERDEAD, allowing the mutex to be recovered. This will make it similar to the file and socket locking solution. Look up pthread_mutexattr_setrobust() and pthread_mutex_consistent() for details. Thanks, Reinier Torenbeek, for this hint.

edited Aug 07 '22 at 16:41

answered Jan 13 '10 at 17:04

Stéphane

19,459
24
95
136

1

+1, in ended up doing the same thing, semaphores are useless in such scenarios – Anurag Uniyal Aug 10 '11 at 14:45
1

Someone e-mailed me to ask for more details. I did write a small blog post almost 3 years ago when I ran into this problem. More details on how I solved it with file lock is available here: http://charette.no-ip.com:81/programming/2010-01-13_PosixSemaphores/index.html – Stéphane Dec 01 '12 at 21:46
Can the same thing be achieved by simply opening a closing a file? I found this on the man page for open(): "When opening a file, a lock with flock(2) semantics can be obtained by setting O_SHLOCK for a shared lock, or O_EXLOCK for an exclusive lock." – Raffi Khatchadourian Mar 08 '13 at 22:48
Hi, Any hint on replacing sem_timedwait? Does 'select' block/unblock on locks? – ronszon Mar 11 '13 at 14:04
@AnuragUniyal Agreed. Funny enough, I had a look at the sqlite code base and they are using POSIX semaphores! – Raffi Khatchadourian Mar 21 '13 at 19:04
Does anyone have a suggestion on how to do this when I want my semaphore to have a higher initial value? I could have a list of files, and iterate through them finding one to lock, but that would rapidly get expensive... – Chris Jefferson Feb 13 '17 at 10:59
@Stéphane Maybe also consider using mutex by sharing memory?? I just found this post here: [Is it possible to use mutex in multiprocessing case on Linux/UNIX ?](http://stackoverflow.com/questions/9389730/is-it-possible-to-use-mutex-in-multiprocessing-case-on-linux-unix) – yaobin May 17 '17 at 14:42
One thing to keep in mind with flock is queueing is not fair at all. With 3 tasks or more number of turns waited can be anything ... mutex / semaphore don't make guarantees either but in practice you get 99% fifo order. – lemonsqueeze Apr 22 '18 at 10:22
Argh. 2022 and I am still stumbling across this problem. The fact that the POSIX Semaphores don't offer a way to handle crash reliable is... bad. The fact that the "outdated SysV semaphores" DO offer this is even worse. And MacOS doesn't even have a "sem_timedwait"; which makes the oh so modern POSIX Semaphores completely useless. What a sad state of affairs. – Ingo Blackman Mar 25 '22 at 17:29

score 6 · Answer 2 · edited May 16 '23 at 13:14

6

Use a lock file instead of a semaphore, much like @Stéphane's solution but without the flock() calls. You can simply open the file using an exclusive lock:

//call to open() will block until it can obtain an exclusive lock on the file.
errno = 0;
int fd = open("/tmp/.lockfile", 
    O_CREAT | //create the file if it's not present.
    O_WRONLY | //only need write access for the internal locking semantics.
    O_EXLOCK, //use an exclusive lock when opening the file.
    S_IRUSR | S_IWUSR); //permissions on the file, 600 here.

if (fd == -1) {
    perror("open() failed");
    exit(EXIT_FAILURE);
}

printf("Entered critical section.\n");
//Do "critical" stuff here.
//...

//exit the critical section
errno = 0;
if (close(fd) == -1) {
    perror("close() failed");
    exit(EXIT_FAILURE);
}

printf("Exited critical section.\n");

edited May 16 '23 at 13:14

mcabreb

196
3
10

answered Mar 08 '13 at 23:34

Raffi Khatchadourian

3,042
3
31
37

2

Good code, with 1 modification: You should create the lock file before contention begins, and, of course, keep it linked. Otherwise, in my tests in Mac OS X 10.10 DP5, open() can succeed for two peer processes contending to initially create the file, if within a few milliseconds. Issue occurs with either Stéphane's or Raffi's code. I then stress-tested. Result: Raffi's code worked perfectly, Stéphane's code not quite. I did not study why. If interested see https://github.com/jerrykrinock/ClassesObjC/blob/master/SSYSemaphore.h, and .m. – Jerry Krinock Aug 15 '14 at 20:33
@JerryKrinock But, doesn't `open()` create the lock file if it's not present (when given the O_CREAT) flag? – Raffi Khatchadourian Dec 12 '14 at 21:06
1

I'm going to shoot from the hip, because I don't have the 30 minutes that it would take to properly refresh my understanding of this available right now. I think the answer is that, yes, open() with O_CREAT will create the file if needed, but if two processes execute open() within a few milliseconds of each other, results are unpredictable. Hence my suggestion to create the lock file before it matters; well, I'll add, unless it is OK for the first contention to be a throwaway. – Jerry Krinock Dec 13 '14 at 01:36
3

compiling error："error: ‘O_EXLOCK’ undeclared (first use in this function)", Ubuntu16.04 LTS – CodyChan Sep 04 '18 at 02:00

score 5 · Answer 3 · answered Jan 13 '10 at 14:26

This is a typical problem when managing semaphores. Some programs use a single process to manage the initialization/deletion of the semaphore. Usually this process does just this and nothing else. Your other applications can wait until the semaphore is available. I've seen this done with the SYSV type API, but not with POSIX. Similar to what 'Duck' mentioned, using the SEM_UNDO flag in your semop() call.

But, with the information that you've provided I would suggest that you do not to use semaphores. Especially if your process is in danger of being killed or crashing. Try to use something that the OS will cleanup automagically for you.

score 2 · Answer 4 · answered Jan 13 '10 at 02:51

2

You'll need to double check but I believe sem_post can be called from a signal handler. If you are able to catch some of the situations that are bringing down the process this might help.

Unlike a mutex any process or thread (with permissions) can post to the semaphore. You can write a simple utility to reset it. Presumably you know when your system has deadlocked. You can bring it down and run the utility program.

Also the semaphone is usually listed under /dev/shm and you can remove it.

SysV semaphores are more accommodating for this scenario. You can specify SEM_UNDO, in which the system will back out changes to the semaphore made by a process if it dies. They also have the ability to tell you the last process id to alter the semaphore.

answered Jan 13 '10 at 02:51

Duck

26,924
5
64
92

1

Some signals like kill -9 bypasses signal handers, which is the situation I've run into. I do have a signal handler for the ones I can catch, and in a destructor for a scope-based object I do call sem_post() as the stack unwinds. But those few lingering uncatchable signals is what I was hoping to solve. – Stéphane Jan 13 '10 at 03:05
1

I think a fair question is to ask who are the users and why are they killing the app that way? You can try the SysV route or even file locks, which should revert when the process dies. – Duck Jan 13 '10 at 03:24
Actually, that is what I decided to do last night. Since files that have been open() and lockf() are automatically released when applications are killed -9, this method of "communication" actually works more reliably than semaphores considering what I need to coordinate. – Stéphane Jan 13 '10 at 16:45

score 1 · Answer 5 · answered Jan 13 '10 at 01:06

1

You should be able to find it from the shell using lsof. Then possibly you can delete it?

Update

Ah yes... man -k semaphore to the rescue.

It seems you can use ipcrm to get rid of a semaphore. Seems you aren't the first with this problem.

answered Jan 13 '10 at 01:06

Carl Smotricz

66,391
18
125
167

1

Yes, I know about ipcrm, but it doesn't help. If I knew the semaphore had been lost, I could just as easily sem_post() to "get it back". The problem seems to be there is no event triggered to indicate that the application that last decremented it has been killed. – Stéphane Jan 13 '10 at 01:16
1

In addition, just noticed on the man page that ipcrm only works on the old SysV semaphores, not POSIX semaphores. Same with ipcs. – Stéphane Jan 13 '10 at 01:44

score 1 · Answer 6 · answered Jan 13 '10 at 01:18

1

If the process was KILLed then there won't be any direct way to determine that it has gone away.

You could operate some kind of periodic integrity check across all the semaphores you have - use semctl (cmd=GETPID) to find the PID for the last process that touched each semaphore in the state you describe, then check whether that process is still around. If not, perform clean up.

answered Jan 13 '10 at 01:18

martin clayton

76,436
32
213
198

Something along these lines is what I was looking for, but of course for the POSIX semaphores you'd find in #include . From what I can tell, the semctl() style of calls are specific to the old SysV semaphores from . – Stéphane Jan 13 '10 at 01:29

AshkanVZ · Answer 7 · 2019-08-05T05:20:57.247

If you use a named semaphore, then you can use an algorithm like the one used in lsof or fuser.

Take these in your consideration:

1.Each named POSIX semaphore creates a file in a tmpfs file system usually under the path:

/dev/shm/

2.Each process has a map_files in linux, under the path:

/proc/[PID]/map_files/

These map files, shows which part of a process memory map to what!

So using these steps, you can find whether the named semaphore is still opened by another process or not:

1- (Optional) Find the exact path of named semaphore (In case its not under /dev/shm)

First open the named semaphore in the new process and assign the result to a pointer
Find the address location of the pointer in the memory (usually with a casting of the address of the pointer to in integer type) and convert it to hexadecimal (i.e result: 0xffff1234) number and then use this path:

/proc/self/map_files/ffff1234-*

there should be only one file that fulfills this criteria.
Get the symbolic link target of that file. It is the full path of the named semaphore.

2- Iterate over all processes to find a map file that its symbolic link taget matches the full path of the named semaphore. If there is one, then the semaphore is in real use, but if there is none, then you can safely unlink the named semaphore and reopen it again for your usage.

UPDATE

In step 2, when iterating over all processes, instead of iterating over all files in the folder map_file, it is beter to use the file /proc/[PID]/maps and search the full path of the named semaphore file (i.e: /dev/shm/sem_xyz) inside it. In this approach, even if some other programs unlinked the named semaphore but the semaphore is still using in other processes, it still can be found but a flag of "(deleted)" is appended at the end of its file path.

score -1 · Answer 8 · edited Jan 02 '12 at 10:38

-1

Simply do a sem_unlink() immediately after the sem_open(). Linux will remove after all processes have closed the resource, which includes internal closes.

edited Jan 02 '12 at 10:38

sra

23,820
7
55
89

answered Mar 16 '11 at 19:48

gsb

9

2

Wouldn't that cause the named semaphore to be deleted? I didn't want to delete it, I wanted to post it when one of the apps crashed or was killed. (Multiple apps in use here, all working with the same set of named semaphores to coordinate some internal work.) – Stéphane Mar 17 '11 at 04:12
2

This won't work: if another process is blocked on the semaphore, and the process that has it locked crashes, then the blocked process will keep the semaphore open and so the semaphore will never be destroyed. – David Given Sep 17 '12 at 16:50

How do I recover a semaphore when the process that decremented it to zero crashes?

8 Answers8

Linked