5

I have a Bash script (Bash 3.2, Mac OS X 10.8) that invokes multiple Python scripts in parallel in order to better utilize multiple cores. Each Python script takes a really long time to complete.

The problem is that if I hit Ctrl+C in the middle of the Bash script, the Python scripts do not actually get killed. How can I write the Bash script so that killing it will also kill all its background children?

Here's my original "reduced test case". Unfortunately I seem to have reduced it so much that it no longer demonstrates the problem; my mistake.

set -e

cat >work.py <<EOF
import sys, time
for i in range(10):
    time.sleep(1)
    print "Tick from", sys.argv[1]
EOF

function process {
    python ./work.py $1 &
}

process one
process two
wait

Here's a complete test case, still highly reduced, but hopefully this one will demonstrate the problem. It reproduces on my machine... but then, two days ago I thought the old test case reproduced on my machine, and today it definitely doesn't.

#!/bin/bash -e
set -x

cat >work.sh <<EOF
for i in 0 1 2 3 4 5 6 7 8 9; do
    sleep 1; echo "still going"
done
EOF
chmod +x work.sh

function kill_all_jobs { jobs -p | xargs kill; }
trap kill_all_jobs SIGINT

function process {
    ./work.sh $1
}

process one &
wait $!
echo "All done!"

This code continues to print still going even after Ctrl+C. But if I move the & from outside process to inside (i.e.: ./work.sh $1 &), then Ctrl+C works as expected. I don't understand this at all!

In my real script, process contains more than one command, and the commands are long-running and must run in sequence; so I don't know how to "move the & inside process" in that case. I'm sure it's possible, but it must be non-trivial.

$ bash --version
GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin12)
Copyright (C) 2007 Free Software Foundation, Inc.

EDIT: Many thanks to @AlanCurry for teaching me some Bash stuff. Unfortunately I still don't understand exactly what's going on in my examples, but it's practically a moot point, as Alan also helpfully pointed out that for my real-world parallelization problem, Bash is the wrong tool and I ought to be using a simple makefile with make -j3! make runs things in parallel where possible, and also understands Ctrl+C perfectly; problem solved (even though question unanswered).

Quuxplusone
  • 23,928
  • 8
  • 94
  • 159
  • `trap` should work fine with shell functions. I took your script, added your `killstuff` definition and the `trap killstuff SIGINT` command at the top, ran it with bash, and it worked fine. I'm on Linux though, not MacOS, so maybe that's why yours behaves differently. Try running with `set -x` at the top, it will generate some debugging output and you can post that here. – Alan Curry Jul 27 '12 at 23:09
  • @AlanCurry: You're right, my old test case doesn't actually reproduce the issue. :( Yesterday I thought it did, but I must have mixed up a couple different versions of it. As my new test case shows, even a single character can make a big difference in the script's behavior. So, want to take another crack at it? – Quuxplusone Jul 29 '12 at 08:56
  • 1
    I don't have a good solution yet, but I can explain what's happening. When you run a shell function in the background, it runs in a subshell (a separate process). So you have 3 processes running: the main shell running the script, the subshell running the function, and the python process. Your kill_all_jobs function sends SIGTERM to the subshell, killing it but not the grandchild python process. And all of it is necessary only because python's default SIGINT handler is refusing to die when the original ^C was pressed. (python is in the foreground process group so it *does* get a SIGINT) – Alan Curry Jul 29 '12 at 18:40
  • But my new test case doesn't use Python at all, so that can't be the whole story...? It definitely has something to do with inability-to-kill-grandchild-processes, though. – Quuxplusone Jul 29 '12 at 19:16
  • That test case has a different problem. There are 4 processes: shell running main script, subshell for the function in the background, shell running work.sh, and sleep which is a separate process. The shell running the work.sh script gets the SIGINT and decides that you intended to kill the sleep but not the work.sh script. Do we really need to solve that one too? – Alan Curry Jul 29 '12 at 19:27
  • I believe that one is closer to my actual problem, yeah. I originally blamed Python and/or my ignorance of `trap`, but I'm gradually finding that the things I thought were problems weren't, and vice versa. I just opened http://chat.stackoverflow.com/rooms/14599/bash-scripting if you're available to chat there. (I may or may not be there in half an hour, but I'll remember to check in during work hours on Monday.) – Quuxplusone Jul 29 '12 at 20:44
  • 1
    Further research has revealed that the main problem is that SIGINT is ignored for backgrounded processes in general. Letting them keep going after the main script has exited is an intentional feature and it's hard to get around. You can't even put `trap - 2` in the work.sh because the trap builtin refuses to unignore signals that were ignored when the script started. – Alan Curry Jul 30 '12 at 00:25

2 Answers2

1

Your trap looks good to me:

$ bash --version
GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin11)
Copyright (C) 2007 Free Software Foundation, Inc.

$ cat ./thang 
#! /bin/bash
set -e

cat >work.py <<EOF
import sys, time
for i in range(10):
  time.sleep(1)
  print "Tick from", sys.argv[1]
EOF

function process {
  python ./work.py $1 &
}

function killstuff {
  jobs -p | xargs kill
}

trap killstuff SIGINT

process one
process two
wait

$ ./thang 
Tick from one
Tick from two
Tick from one
Tick from two
^C$ ps aux | grep python | grep -v grep
$
phs
  • 10,687
  • 4
  • 58
  • 84
  • I screwed up the test case somehow. The problem must not be with the `trap`, but something to do with which processes count as my children, my grandchildren, or someone else's children entirely. Take a look at the new test case? BTW, thanks for the `jobs -p | xargs kill` idiom; that's a lot better than my original for loop. – Quuxplusone Jul 29 '12 at 08:58
1

I got it! All you have to do is get rid of that python SIGINT handler.

cat >work.py <<'EOF'
import sys, time, signal
signal.signal(signal.SIGINT, signal.SIG_DFL)
for i in range(10):
    time.sleep(1)
    print "Tick from", sys.argv[1]
EOF 
chmod +x work.py

function process {
    python ./work.py $1
}

process one &
wait $!
echo "All done!"
Alan Curry
  • 14,255
  • 3
  • 32
  • 33