4

If I do this in a bash script:

sleep 10 &
sleep_pid=$!
some_command &
wait -n
cmd_pid=$!

if kill -0 $sleep_pid 2> /dev/null; then
    # all ok
    kill $sleep_pid
else
    # some_command hung
    ...code to log diagnostics and then kill -9 $cmd_pid...
fi

where some_command is something that should be quick but can hang due to rare errors.

Is there then a risk that some_command can be done and cleaned up before "wait -n" starts, so there is only the sleep to wait for? Or does the '&' after one command guarantee that the shell won't call waitpid() on it until the next line of input has been handled?

It works in interactive shells. If you do:

sleep 10 &
sleep 0 &
wait -n

then the "wait -n" returns right away even if you wait a couple of seconds before running it. But I'm not sure if it can be trusted for non-interactive shells?

EDIT: Clarifying need for diagnostics + some grammar.

imz -- Ivan Zakharyaschev
  • 4,921
  • 6
  • 53
  • 104
  • 1
    It's *more* trustworthy in non-interactive shells -- you don't have your process-table entries getting reaped to give the user interactive feedback on jobs that completed. I wouldn't particularly trust this code in an interactive shell, but it should be quite solid in a noninteractive one. – Charles Duffy Jun 21 '18 at 21:59
  • @CharlesDuffy So non-interactive shells don't do waitpid()/wait() unless explicitly asked to via the wait builtin? That means I should stop worrying about this and start looking for process leaks in all my other long running scripts instead. :) – Henrik Johansson Jun 21 '18 at 22:30
  • And a solution similar to yours is suggested in an answer there: https://stackoverflow.com/a/10028986/94687 I find this kind of solutions elegant and clever; I couldn't come up with something similar myself. – imz -- Ivan Zakharyaschev Feb 09 '22 at 18:00

3 Answers3

2

I believe you may be able to use the timeout command to do this.
http://man7.org/linux/man-pages/man1/timeout.1.html

timeout 10s command_to_run

You can check the exit status of the timeout command to know if it timed out.

timeout 2s sleep 10

if [[ $? -gt 0 ]]; then
  echo "it timed out"
else
  echo "It was successful"
fi
ThrasherHT
  • 21
  • 2
  • timeout works for most cases, but it doesn’t work if the thing you want to run is a function in your shell script, or if you want to get a stack trace before killing. (Maybe some versions of timeout allows the latter?) – Henrik Johansson Jun 21 '18 at 20:57
0

By using the $! variable, we avoid relying on interactive job control features. Try this:

...long executing command... &
pid_long=$!

sleep 3 &
pid_sleep=$!

wait -n
kill -KILL $pid_long

The problem here is PID recycling. Unlikely to happen in 3 seconds, though.

In the case when the command finishes earlier than the sleep (and its PID has not been recycled to a new process) kill produces an error message; we could pipe that to /dev/null.

We should probably also kill the sleep in case it is the one that is lingering.

Kaz
  • 55,781
  • 9
  • 100
  • 149
  • PID recycling isn't going to happen if the old entry is still in the process table, and if `waitpid()` or `wait()` hasn't been called (which is automatic only in interactive shells), it'll still be there as a zombie. – Charles Duffy Jun 21 '18 at 21:56
  • In my effort to keep the question short and to the point, I left out some detail that matters in this case (added in edit now). Speculative killing won't let me detect the problem and collect diagnostic info before cleaning up (which I didn't mention in the original question). – Henrik Johansson Jun 21 '18 at 21:58
  • @CharlesDuffy But `wait` **has** been called. When "wait -n" happens to reap the interesting command rather than `sleep`, then `$pid_long` is no longer a valid PID. `kill` will produce an error about a nonexistent PID. I reproduced this case in testing. – Kaz Jun 21 '18 at 22:31
  • @CharlesDuffy Non-interactive script. – Kaz Jun 22 '18 at 02:26
  • Oh -- I misread what you were saying. Yes, you're right -- if `wait -n` reaps the interesting command, it's no longer running, so it can't be killed, and yes, the PID could potentially be repurposed in that case (since it no longer has a zombie). – Charles Duffy Jun 22 '18 at 03:27
0

As @CharlesDuffy pointed out in comments, the answer is no, there is no race (provided it is run in a non-interactive shell).

Also there is no need (in non-interactive shells) to make sure the wait comes directly after the command, as non-interactive shells don't do automatic reaping of children.

But I guess one should wrap this in a sub-shell, so "wait -n" won't return early due to some previously started unrelated background job.