5

General question: What could cause a script that works fine in itself, to hang if the script or shell (bash) command that calls it calls it into a variable?

In other words, how could there be a script that works when called like this... /path/to/script arg arg ...fail and hang when called like this... VAR=$(/path/to/script arg arg);?


(major edits after noticing that a software glitch caused much initial testing to give incorrect results)


My specific case: I've got a script that works fine (to start, stop or restart the java application Apache Solr, adapted from here). Code is below, its command is sbin/service solr [action], for example sbin/service solr start.

When called from a script or directly from a console (bash in my case), like sbin/service solr start, it works fine and completes quickly. If, however, it's called into a variable, like VAR=$(sbin/service solr start);, it works, but hangs with a futext / clock_gettime loop (trace below). It also hangs if it's called not into a variable, but into strace.

Curiously, other scripts called in the same way with the same syntax, e.g. sbin/service httpd start, work just fine when called into a variable. So it's clearly possible for there to be something about a script that makes it hang when the output is stored as a variable, but operate perfectly well when this isn't the case.


Here's the results of testing what calls hang and what don't:

HANGS------------------------------------------------

  • VAR=$(/sbin/service solr start);
  • VAR=$(source /sbin/service solr start);
  • VAR=$(nohup /sbin/service solr start &);

(so it doesn't matter which process it's called from) Also, editting the script file to start the service using source causes the service to not work.

DOESN'T HANG-------------------------------------

  • VAR=$(/sbin/service solr start >> /dev/null);

Outputting to /dev/null allows us to request the output without causing it to hang. However, it's not much use, as no actual output is received.

  • /sbin/service solr start

Contrary to what I first thought. This outputs a simple update message which, ideally, we'd capture in a variable and log - but attempting to do so causes it to hang.

  • VAR=$(/sbin/service httpd restart);

The syntax that hangs works just fine on other service scripts, and the output of the script is passed to the variable without issue.


Here's the full code of that script: (comments removed, naturally the $SOLR_DIR path is a real one in the real script)

SOLR_DIR="[path/to/application]"
JAVA_OPTIONS="-Xms64m -Xmx64m -DSTOP.PORT=8079 -DSTOP.KEY=mustard -jar start.jar"
LOG_FILE="/var/log/solr.log"
JAVA="/usr/bin/java"

case $1 in
    start)
        echo "Starting Solr"
        cd $SOLR_DIR
        $JAVA $JAVA_OPTIONS 2> $LOG_FILE &
        ;;
    stop)
        echo "Stopping Solr"
        cd $SOLR_DIR
        $JAVA $JAVA_OPTIONS --stop
        ;;
    restart)
        $0 stop
        sleep 1
        $0 start
        ;;
    *)
        echo "Usage: $0 {start|stop|restart}" >&2
        exit 1
        ;;
esac

No errors or anything unusual in var/log/solr.log (log file named in script). Centos Linux server if that's relevant.


In response to an earlier version of the question, @cdarke suggested I run strace -f -o strace.out /path/to/script on a script calling this script, and look in the (massive!) output file strace.out. It's nearly 3mbs, here's some observations:

  1. Starts with lots of activity that looks like the script functioning as intended.

  2. Then, the last 15% or so of the log file is this, repeated with different integers but seemingly the same hex codes:

...

25687 futex(0x688d454, FUTEX_WAIT_PRIVATE, 1, {0, 49980000}) = -1 ETIMEDOUT (Connection timed out)
25687 futex(0x688d428, FUTEX_WAKE_PRIVATE, 1) = 0
25687 clock_gettime(CLOCK_MONOTONIC, {39074112, 932735888}) = 0
25687 clock_gettime(CLOCK_REALTIME, {1355007234, 333458000}) = 0

These PIDs come up with nothing when put through ps -p even if I do so while the script is still running, while the output file is still getting bigger and while these lines of code are still being written. I'm not quite sure how that's possible.

Here's the final bit out the output before it gets in the never-ending futex/clock_gettime loop, after the last part that is clearly the script executing correctly (solr/solr.xml is a Solr config file, which needs to be read to start the Solr process):

25874 stat("solr/solr.xml", {st_mode=S_IFREG|0777, st_size=1320, ...}) = 0
25874 write(2, "Dec 8, 2012 5:12:05 PM org.apach"..., 106) = 106
25874 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 89
25874 fcntl(89, F_GETFL)                = 0x2 (flags O_RDWR)
25874 fcntl(89, F_SETFL, O_RDWR|O_NONBLOCK) = 0
25874 setsockopt(89, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
25874 bind(89, {sa_family=AF_INET, sin_port=htons(8983), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
25874 listen(89, 50)                    = 0
25874 setsockopt(89, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
25874 lseek(12, 57747, SEEK_SET)        = 57747
25874 read(12, "PK\3\4\n\0\0\0\10\0\221Vi>F\347\254\364\325\4\0\0002\t\0\0002\0\0\0", 30) = 30
25874 lseek(12, 57827, SEEK_SET)        = 57827
25874 read(12, "\225V\377oSU\24\377\334\273\256\257_\36l\216m\254\262\351\224\241]\273\255\200\314/\5\246c\200"..., 1237) = 1237
25874 futex(0x2aaab0173054, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2aaab0173050, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
25894 <... futex resumed> )             = 0
25894 futex(0x2aaab0173028, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
25874 <... futex resumed> )             = 1
25874 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
25894 <... futex resumed> )             = 0
25894 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1) = 0
25894 clock_gettime(CLOCK_REALTIME, {1355008325, 376033000}) = 0
25894 futex(0x2aaab0173054, FUTEX_WAIT_PRIVATE, 3, {0, 983000} <unfinished ...>
25874 <... futex resumed> )             = 1
25874 futex(0x2aaab0173054, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2aaab0173050, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
25894 <... futex resumed> )             = 0
25894 futex(0x2aaab0173028, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
25874 <... futex resumed> )             = 1
25874 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
25894 <... futex resumed> )             = 0
25894 futex(0x2aaab0173028, FUTEX_WAKE_PRIVATE, 1) = 0
25894 poll([{fd=89, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
25874 <... futex resumed> )             = 1
25874 write(2, "2012-12-08 17:12:05.376:INFO::St"..., 66) = 66
25874 write(2, "\n", 1)                 = 1
25874 mmap(0x41348000, 12288, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x41348000
25874 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
25874 sched_getaffinity(25874, 32,  { ffff, 0, 0, 0 }) = 32
25874 sched_getaffinity(25874, 32,  { ffff, 0, 0, 0 }) = 32
25874 gettid()                          = 25874
25874 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
25874 rt_sigprocmask(SIG_UNBLOCK, [HUP ILL BUS FPE SEGV USR2 TERM], NULL, 8) = 0
25874 rt_sigprocmask(SIG_BLOCK, [QUIT], NULL, 8) = 0
25874 mmap(0x41348000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x41348000
25874 mprotect(0x41348000, 12288, PROT_NONE) = 0
25874 futex(0x10632d54, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
25882 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0
25882 clock_gettime(CLOCK_MONOTONIC, {39075204, 21489888}) = 0
25882 clock_gettime(CLOCK_REALTIME, {1355008325, 422198000}) = 0
25882 futex(0x106cc454, FUTEX_WAIT_PRIVATE, 1, {0, 49984000}) = -1 ETIMEDOUT (Connection timed out)
25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0
25882 clock_gettime(CLOCK_MONOTONIC, {39075204, 72479888}) = 0
25882 clock_gettime(CLOCK_REALTIME, {1355008325, 473185000}) = 0
25882 futex(0x106cc454, FUTEX_WAIT_PRIVATE, 1, {0, 49987000}) = -1 ETIMEDOUT (Connection timed out)
25882 futex(0x106cc428, FUTEX_WAKE_PRIVATE, 1) = 0

So the last line before the death spiral is a read() on channel 12. Then it just loops futex and clock_gettime until it's manually killed.


This last point may be irrelevant, but if, similar to in this question, I run the script that calls this script using nohup and divert output to /dev/null, I get the following near the start ( about 100kb into the output file): a huge number of these:

25664 close(67) = -1 EBADF (Bad file descriptor)

They go from 67, increasing +1 each time, to:

25664 close(1023) = -1 EBADF (Bad file descriptor)

Then they're followed by

25664 open("/dev/null", O_RDWR) = 3

Again, the PIDs are empty, so far as I can see. Not sure if this is relevant - I guess this opens the possibility that using nohup with output to /dev/null really is a general fix for problems like this but I'm doing it wrong somehow, causing these errors.

Community
  • 1
  • 1
user56reinstatemonica8
  • 32,576
  • 21
  • 101
  • 125
  • Which shell? For example, ksh93 will use `exec` to execute the final command in a script, so you won't necessarily see two processes. We need to know if the 'hang' is waiting on something, like a signal. Try running `strace -f -o strace.out ScriptB`. Then look at strace.out (it will be large). That should trace the kernel calls and show if it is waiting on anything. – cdarke Dec 08 '12 at 22:01
  • @cdarke Thanks for your post. Re. which shell: `ps -p $$ -o comm,args` gives me `bash` with args `-bash`. Re. `strace` - you were right, it's nearly 3mbs! I'm going to edit my observations from that file into the question. – user56reinstatemonica8 Dec 08 '12 at 22:59
  • @cdarke I've editted in all the output that looks to me to be relevant, along with the very limited info I can get by looking up the PIDs of those processes. Many thanks for the help, please do ask if there's anything specific I should look for in the output. – user56reinstatemonica8 Dec 08 '12 at 23:26
  • 1. not sure why you used `nohup $( /sbin/service .. ) > /dev/null`. That can only complicate things (I think). try just a plain `nohup /sbin/service ... > /dev/null` (but you probably have tried this, given the time you've put into this). 2. From your headline, I thought a "a blocking terminal read", meaning just the plain shell `read` command waiting for input, but I don't see it in your code posted, but it's in your trace... Just an idea for further exploration. 3. As an alternate, can you take out the nohups and run this all from a crontab entry? Good luck – shellter Dec 09 '12 at 00:55
  • Turns out a software glitch caused much of my initial testing to be wrong. There's still an unexplained (to me) issue with the script, but it's not quite as crazy as it seemed before. Sorry for the misinformation - I've re-scoped the question, seems much less dauntingly counter-intuitive now and has less serious symptoms. – user56reinstatemonica8 Dec 09 '12 at 01:53

1 Answers1

2

I'm pretty sure the problem is that the shell is capturing output from the /sbin/service script and the solr service it starts, and hence will wait for the service to exit (or at least close its stdout) before continuing. Here's a simple demo:

$ bg_service() { while true; do sleep 10; done; }
$ start_bg_service() { echo "starting"; bg_service& echo "running"; }
$ start_bg_service 
starting
[1] 8656
running
$ var=$(start_bg_service)
[It hangs at this point... until I open another shell and kill the background process]
Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151
  • That sounds likely. Any ideas on how I can start the solr service in a way that doesn't capture output of the nested script and hang? I'd like what calls the startup script to capture output from the startup script itself, and output from the process launched by the startup script to go to a log file and be ignored by the parent that called it – user56reinstatemonica8 Dec 09 '12 at 12:43
  • 1
    The startup script currently redirects stderr of the server to a log; just modify it to redirect stdout there as well: `$JAVA $JAVA_OPTIONS &> $LOG_FILE &` – Gordon Davisson Dec 09 '12 at 18:17
  • Cool. Based on your comment I went with `$JAVA $JAVA_OPTIONS 1> /dev/null 2> $LOG_FILE &` to discard the stdout and log the stderr, which more closely mirrors the previous set up (log errors but don't log standard output). Same principle, works perfectly! Thanks :) – user56reinstatemonica8 Dec 09 '12 at 21:28
  • 1
    If you don't need stdin I would recommend adding – Gilbert Mar 30 '13 at 02:02