305

I have a python script that'll be checking a queue and performing an action on each item:

# checkqueue.py
while True:
  check_queue()
  do_something()

How do I write a bash script that will check if it's running, and if not, start it. Roughly the following pseudo code (or maybe it should do something like ps | grep?):

# keepalivescript.sh
if processidfile exists:
  if processid is running:
     exit, all ok

run checkqueue.py
write processid to processidfile

I'll call that from a crontab:

# crontab
*/5 * * * * /path/to/keepalivescript.sh
Benyamin Jafari
  • 27,880
  • 26
  • 135
  • 150
Tom
  • 42,844
  • 35
  • 95
  • 101
  • 4
    Just to add this for 2017. Use supervisord. crontab is not mean to do this kind of task. A bash script is terrible on emitting the real error. http://stackoverflow.com/questions/9301494/how-to-restart-only-certain-processes-using-supervisorctl – mootmoot Mar 07 '17 at 19:10
  • 1
    How about using inittab and respawn instead of other non-system solutions? See https://superuser.com/a/507835/116705 – Lars Nordin Nov 27 '19 at 16:00

10 Answers10

759

Avoid PID-files, crons, or anything else that tries to evaluate processes that aren't their children.

There is a very good reason why in UNIX, you can ONLY wait on your children. Any method (ps parsing, pgrep, storing a PID, ...) that tries to work around that is flawed and has gaping holes in it. Just say no.

Instead you need the process that monitors your process to be the process' parent. What does this mean? It means only the process that starts your process can reliably wait for it to end. In bash, this is absolutely trivial.

until myserver; do
    echo "Server 'myserver' crashed with exit code $?.  Respawning.." >&2
    sleep 1
done

The above piece of bash code runs myserver in an until loop. The first line starts myserver and waits for it to end. When it ends, until checks its exit status. If the exit status is 0, it means it ended gracefully (which means you asked it to shut down somehow, and it did so successfully). In that case we don't want to restart it (we just asked it to shut down!). If the exit status is not 0, until will run the loop body, which emits an error message on STDERR and restarts the loop (back to line 1) after 1 second.

Why do we wait a second? Because if something's wrong with the startup sequence of myserver and it crashes immediately, you'll have a very intensive loop of constant restarting and crashing on your hands. The sleep 1 takes away the strain from that.

Now all you need to do is start this bash script (asynchronously, probably), and it will monitor myserver and restart it as necessary. If you want to start the monitor on boot (making the server "survive" reboots), you can schedule it in your user's cron(1) with an @reboot rule. Open your cron rules with crontab:

crontab -e

Then add a rule to start your monitor script:

@reboot /usr/local/bin/myservermonitor

Alternatively; look at inittab(5) and /etc/inittab. You can add a line in there to have myserver start at a certain init level and be respawned automatically.


Edit.

Let me add some information on why not to use PID files. While they are very popular; they are also very flawed and there's no reason why you wouldn't just do it the correct way.

Consider this:

  1. PID recycling (killing the wrong process):

    • /etc/init.d/foo start: start foo, write foo's PID to /var/run/foo.pid
    • A while later: foo dies somehow.
    • A while later: any random process that starts (call it bar) takes a random PID, imagine it taking foo's old PID.
    • You notice foo's gone: /etc/init.d/foo/restart reads /var/run/foo.pid, checks to see if it's still alive, finds bar, thinks it's foo, kills it, starts a new foo.
  2. PID files go stale. You need over-complicated (or should I say, non-trivial) logic to check whether the PID file is stale, and any such logic is again vulnerable to 1..

  3. What if you don't even have write access or are in a read-only environment?

  4. It's pointless overcomplication; see how simple my example above is. No need to complicate that, at all.

See also: Are PID-files still flawed when doing it 'right'?

By the way; even worse than PID files is parsing ps! Don't ever do this.

  1. ps is very unportable. While you find it on almost every UNIX system; its arguments vary greatly if you want non-standard output. And standard output is ONLY for human consumption, not for scripted parsing!
  2. Parsing ps leads to a LOT of false positives. Take the ps aux | grep PID example, and now imagine someone starting a process with a number somewhere as argument that happens to be the same as the PID you stared your daemon with! Imagine two people starting an X session and you grepping for X to kill yours. It's just all kinds of bad.

If you don't want to manage the process yourself; there are some perfectly good systems out there that will act as monitor for your processes. Look into runit, for example.

Community
  • 1
  • 1
lhunath
  • 120,288
  • 16
  • 68
  • 77
  • You might add some code to send a message or stop the loop if it restarts too many times in a short period of time. – Chas. Owens Mar 30 '09 at 13:40
  • +1 most correct answer. But you are somewhat too pragmatic about pid files... SysV init scripts are based heavily on pid files, mostly because the start and stop states may be in different pgids. – Juliano Mar 30 '09 at 23:02
  • 2
    @Chas. Ownes: I don't think that's necessary. It would just complicate the implementation for no good reason. Simplicity is always more important; and if it restarts often, the sleep will keep it from having any bad impact on your system resources. There is already a message anyway. – lhunath Mar 31 '09 at 06:22
  • 1
    @Juliano: I know PID files are used everywhere. It doesn't mean they're not just as flawed as they were before. Start foo, put its PID in foo.pid. Foo dies. Something else gets started somewhere, takes a random PID which happens to be the one foo *had*. Stopping foo will kill the wrong process! – lhunath Mar 31 '09 at 06:27
  • Only root has access to `/etc/inittab` - how would a mere user ensure that some process always gets restarted in a manner that would handle both a process crash and a system restart? – hippietrail Feb 23 '11 at 08:41
  • http://stackoverflow.com/questions/822797/about-the-pid-of-the-process/822812#822812 – Laurent Debricon Apr 01 '11 at 11:29
  • Sounds clear and easy until you don't need to manage some process with timeout without implementing that logic into child process. There is no convenient and easy to use built-in method to do it. – ДМИТРИЙ МАЛИКОВ Dec 14 '11 at 13:18
  • i know this is naive... Running a script like this and then having a 2nd server ping server 1 to see if the service is up is the best peace of mind I can get I guess. There really isnt any 2nd layer to to making sure the until script is running. I mean why would there be the script is not doing anything. – Tegra Detra Feb 26 '13 at 09:09
  • trap '~/.bin/panic' EXIT; # is this just crazy talk or does it make it safer? – Tegra Detra Feb 26 '13 at 09:46
  • @hippietrail cron has `@reboot` time specification – andreabedini Jul 17 '13 at 05:40
  • How resource-intensive is such a loop and will it make a difference to use sleep greater than 1? – orschiro Nov 28 '13 at 17:30
  • 3
    @orschiro There is no resource consumption when the program behaves. If it exists immediately on launch, continuously, the resource consumption with a sleep 1 is still utterly negligible. – lhunath Nov 29 '13 at 18:19
  • 8
    Can believe I'm *just* seeing this answer. Thanks so much! – getWeberForStackExchange Dec 29 '13 at 05:29
  • Unfortunatelly, my process does not die and return error upon failure. I still need to reset it automatically. – Tomáš Zato Jan 15 '14 at 23:04
  • 5
    @TomášZato you can do the above loop without testing the process' exit code `while true; do myprocess; done` but note that there is now no way to stop the process. – lhunath Jan 19 '14 at 01:57
  • The problem was that the process wouldn't ever exit... I fixed that in the code however and now I'm using your answer. – Tomáš Zato Jan 19 '14 at 13:29
  • I was just writing a process monitor for my autossh tunnels, and searched for the best practice to check process alive based on the pid. I had to scrap most of the code I had already written, I hate you ;) Now it's so simple and efficient, thanks to you! – Floyd Feb 26 '14 at 07:04
  • If bash that is executing provided script is closed then the process that was launched is still being executed. And that's a problem for me. – Sergey P. aka azure Mar 11 '14 at 11:31
  • 2
    @SergeyP.akaazure The only way to force the parent to kill the child on exit in bash is to turn the child into a job and signal it: `trap 'kill $(jobs -p)' EXIT; until myserver & wait; do sleep 1; done` – lhunath Mar 12 '14 at 13:47
  • Let me know if this workaround combined with a PID file is still flawed http://stackoverflow.com/questions/25906020/are-pid-files-still-flawed-when-doing-it-right – Karussell Sep 18 '14 at 06:53
  • What if I want to do it for multiple processes? – Rony Tesler Apr 06 '22 at 14:41
52

Have a look at monit (http://mmonit.com/monit/). It handles start, stop and restart of your script and can do health checks plus restarts if necessary.

Or do a simple script:

while true
do
/your/script
sleep 1
done
Eric
  • 917
  • 2
  • 9
  • 16
Bernd
  • 3,390
  • 2
  • 23
  • 31
27

In-line:

while true; do <your-bash-snippet> && break; done

This will restart continuously <your-bash-snippet> if it fails: && break will stop the loop if <your-bash-snippet> stop gracefully (return code 0).

To restart <your-bash-snippet> in all cases:

while true; do <your-bash-snippet>; done

e.g. #1

while true; do openconnect x.x.x.x:xxxx && break; done

e.g. #2

while true; do docker logs -f container-name; sleep 2; done
Tom
  • 4,666
  • 2
  • 29
  • 48
Benyamin Jafari
  • 27,880
  • 26
  • 135
  • 150
  • This is my favorite answer, in-line works great, no extra software dependency, I want this in command form, let's call it jafari Here is what I used it for while true; do ffmpeg -f x11grab -framerate 30 -video_size 1920x1080 -i :10.0 -f mpegts srt://:6666?mode=listener && break; done there should be jafari ffmpeg -f x11grab -framerate 30 -video_size 1920x1080 -i :10.0 -f mpegts srt://:6666?mode=listener – Shodan Feb 08 '22 at 08:07
  • 1
    The use of `break` does need a little explanation, but that answer is great! – Tom Aug 06 '22 at 08:14
11

The easiest way to do it is using flock on file. In Python script you'd do

lf = open('/tmp/script.lock','w')
if(fcntl.flock(lf, fcntl.LOCK_EX|fcntl.LOCK_NB) != 0): 
   sys.exit('other instance already running')
lf.write('%d\n'%os.getpid())
lf.flush()

In shell you can actually test if it's running:

if [ `flock -xn /tmp/script.lock -c 'echo 1'` ]; then 
   echo 'it's not running'
   restart.
else
   echo -n 'it's already running with PID '
   cat /tmp/script.lock
fi

But of course you don't have to test, because if it's already running and you restart it, it'll exit with 'other instance already running'

When process dies, all it's file descriptors are closed and all locks are automatically removed.

Teddy Markov
  • 266
  • 3
  • 15
vartec
  • 131,205
  • 36
  • 218
  • 244
  • that could conceivably simplify it a bit by removing the bash script. what happens if the python script crashes? is the file unlocked? – Tom Mar 30 '09 at 11:46
  • 1
    File lock is released as soon as the application stops, either by killing, naturally or crashing. – Christian Witts Mar 30 '09 at 11:54
  • @Tom ...to be a little more precise -- the lock is no longer active as soon as the file handle it's on closes. If the Python script never closes the file handle by intent, and makes sure it doesn't get closed automatically via the file object being garbage-collected, then it closing probably means the script exited / was killed. This works even for reboots and such. – Charles Duffy Jul 18 '13 at 12:13
  • 1
    There are much better ways to use `flock`... in fact, the man page explicitly demonstrates how! `exec {lock_fd}>/tmp/script.lock; flock -x "$lock_fd"` is the bash equivalent to your Python, and leaves the lock held (so if you then exec a process, the lock will stay held until that process exits). – Charles Duffy Oct 03 '14 at 23:35
  • I downvoted you because your code is wrong. Using `flock` is the correct way, but your scripts are wrong. The only command you need to set in crontab is: `flock -n /tmp/script.lock -c '/path/to/my/script.py'` – Rutrus Aug 26 '18 at 07:53
6

You should use monit, a standard unix tool that can monitor different things on the system and react accordingly.

From the docs: http://mmonit.com/monit/documentation/monit.html#pid_testing

check process checkqueue.py with pidfile /var/run/checkqueue.pid
       if changed pid then exec "checkqueue_restart.sh"

You can also configure monit to email you when it does do a restart.

clofresh
  • 1,345
  • 1
  • 9
  • 11
5
if ! test -f $PIDFILE || ! psgrep `cat $PIDFILE`; then
    restart_process
    # Write PIDFILE
    echo $! >$PIDFILE
fi
soulmerge
  • 73,842
  • 19
  • 118
  • 155
  • cool, that's fleshing out some of my pseudo code pretty well. two qns: 1) how do I generate PIDFILE? 2) what's psgrep? it's not on ubuntu server. – Tom Mar 30 '09 at 11:43
  • 1
    ps grep is just a small app that does the same as `ps ax|grep ...`. You can just install it or write a function for that: function psgrep() {ps ax|grep -v grep|grep -q "$1"} – soulmerge Mar 30 '09 at 11:46
  • Just noticed that I hadn't answered your first question. – soulmerge Mar 30 '09 at 12:12
  • 7
    On really busy server it's possible that PID will get recycled before you check. – vartec Mar 30 '09 at 12:20
5
watch "yourcommand"

It will restart the process if/when it stops (after a 2s delay).

watch -n 0.1 "yourcommand"

To restart it after 0.1s instead of the default 2 seconds

watch -e "yourcommand"

To stop restarts if the program exits with an error.

Advantages:

  • built-in command
  • one line
  • easy to use and remember.

Drawbacks:

  • Only display the result of the command on the screen once it's finished
Tom
  • 4,666
  • 2
  • 29
  • 48
  • This doesn't seem accurate, "watch - execute a program periodically", meaning it will execute every xx seconds, not if/when the process stops. – smartins Nov 25 '21 at 10:34
  • 2
    @smartins the delay is an interval, per the doc. So with `-n 5` it will run the command again 5 seconds after the last one stopped. You can test it with `watch -n 5 "sleep 5"` and see that it's updated every 10 seconds. – Tom Nov 25 '21 at 13:09
4

I'm not sure how portable it is across operating systems, but you might check if your system contains the 'run-one' command, i.e. "man run-one". Specifically, this set of commands includes 'run-one-constantly', which seems to be exactly what is needed.

From man page:

run-one-constantly COMMAND [ARGS]

Note: obviously this could be called from within your script, but also it removes the need for having a script at all.

  • Does this offer any advantage over the accepted answer? – tripleee Oct 26 '18 at 04:44
  • 1
    Yes, I think it is preferable to use a built-in command than to write a shell script that does the same thing that will have to be maintained as a part of system codebase. Even if the functionality is required as part of a shell script the above command could also be used so it is relevant to a shell scripting question. – Daniel Bradley Oct 27 '18 at 05:23
  • 1
    This is not "built in"; if it's installed by default on some distro, your answer should probably specify the distro (and ideally include a pointer for where to download it if yours isn't one of them). – tripleee Oct 27 '18 at 07:03
  • 1
    Looks like it's an Ubuntu utility; but it's optional even on Ubuntu. https://manpages.ubuntu.com/manpages/bionic/man1/run-one.1.html – tripleee Oct 27 '18 at 07:06
  • 1
    Worth noting: the run-one utilities do exactly what their name says - you can only run one instance of any command that is run with run-one-nnnnn. Other answers here are more executable agnostic - thay don't care about the content of the command at all. – David Kohen Feb 27 '20 at 08:41
1

I've used the following script with great success on numerous servers:

pid=`jps -v | grep $INSTALLATION | awk '{print $1}'`
echo $INSTALLATION found at PID $pid 
while [ -e /proc/$pid ]; do sleep 0.1; done

notes:

  • It's looking for a java process, so I can use jps, this is much more consistent across distributions than ps
  • $INSTALLATION contains enough of the process path that's it's totally unambiguous
  • Use sleep while waiting for the process to die, avoid hogging resources :)

This script is actually used to shut down a running instance of tomcat, which I want to shut down (and wait for) at the command line, so launching it as a child process simply isn't an option for me.

Kevin Wright
  • 49,540
  • 9
  • 105
  • 155
  • 1
    `grep | awk` is still an [antipattern](http://www.iki.fi/era/unix/award.html#grep) - you want `awk "/$INSTALLATION/ { print \$1 }"` to conflate the useless `grep` into the Awk script, which can find lines by regular expression itself very well, thank you very much. – tripleee Nov 12 '15 at 05:01
1

I use this for my npm Process

#!/bin/bash
for (( ; ; ))
do
date +"%T"
echo Start Process
cd /toFolder
sudo process
date +"%T"
echo Crash
sleep 1
done
BitDEVil2K16
  • 326
  • 3
  • 11