3

I have a central server where I periodically start a script (from cron) which checks remote servers. The check is performed serially, so first, one server then another ... .

This script (from the central server) starts another script(lets call it update.sh) on the remote machine, and that script(on the remote machine) is doing something like this:

processID=`pgrep "processName"` 
kill $processID
startProcess.sh

The process is killed and then in the script startProcess.sh started like this:

pidof "processName"

if [ ! $? -eq 0 ]; then
    nohup "processName" "processArgs" >> "processLog" &
    pidof "processName"
    if [! $? -eq 0]; then
        echo "Error: failed to start process"
...

The update.sh, startprocess.sh and the actual binary of the process that it starts is on a NFS mounted from the central server.

Now what happens sometimes, is that the process that I try to start within the startprocess.sh is not started and I get the error. The strange part is that it is random, sometime the process on one machine starts and another time on that same machine doesn't start. I'm checking about 300 servers and the errors are always random.

There is another thing, the remote servers are at 3 different geo locations (2 in America and 1 in Europe), the central server is in Europe. From what I discover so far is that the servers in America have much more errors than those in Europe.

First I thought that the error has to have something to do with kill so I added a sleep between the kill and the startprocess.sh but that didn't make any difference.

Also it seems that the process from startprocess.sh is not started at all, or something happens to it right when it is being started, because there is no output in the logfile and there should be an output in the logfile.

So, here I'm asking for help

Does anybody had this kind of problem, or know what might be wrong?

Thanks for any help

Jan
  • 1,054
  • 13
  • 36
  • 1
    I suspect the you may do better on [Server Fault](http://serverfault.com/) than on Stack Overflow. Your symptoms sound like the trans-Atlantic connections are probably slower, and NFS operations are more likely to time out. If the software is automounted, it could be that the relevant directories are not available when the commands fail, but are available when the commands succeed; I've seen problems like that in a previous life. There's also the "if you have enough machines, something is always failing" syndrome, too. Working with thousands of machines instead of hundreds hammers that home. – Jonathan Leffler Oct 20 '14 at 22:29

1 Answers1

4

(Sorry, but my original answer was fairly wrong... Here is the correction)

Using $? to get the exit status of the background process in startProcess.sh leads to wrong result. Man states:

Special Parameters
?      Expands to the status of the most recently executed foreground
       pipeline.

As You mentioned in your comment the proper way of getting the background process's exit status is using the wait built in. But for this has to process the SIGCHLD signal.

I made a small test environment for this to show how it can work:

Here is a script loop.sh to run as a background process:

#!/bin/bash
[ "$1" == -x ] && exit 1;
cnt=${1:-500}
while ((++c<=cnt)); do echo "SLEEPING [$$]: $c/$cnt"; sleep 5; done

If the arg is -x then it exits with exit status 1 to simulate an error. If arg is num, then waits num*5 seconds printing SLEEPING [<PID>] <counter>/<max_counter> to stdout.

The second is the launcher script. It starts 3 loop.sh scripts in the background and prints their exit status:

#!/bin/bash

handle_chld() {
    local tmp=()
    for i in ${!pids[@]}; do
        if [ ! -d /proc/${pids[i]} ]; then
            wait ${pids[i]}
            echo "Stopped ${pids[i]}; exit code: $?"
            unset pids[i]
        fi
    done
}

set -o monitor
trap "handle_chld" CHLD

# Start background processes
./loop.sh 3 &
pids+=($!)
./loop.sh 2 &
pids+=($!)
./loop.sh -x &
pids+=($!)

# Wait until all background processes are stopped
while [ ${#pids[@]} -gt 0 ]; do echo "WAITING FOR: ${pids[@]}"; sleep 2; done
echo STOPPED

The handle_chld function will handle the SIGCHLD signals. Setting option monitor enables for a non-interactive script to receive SIGCHLD. Then the trap is set for SIGCHLD signal.

Then background processes are started. All of their PIDs are remembered in pids array. If SIGCHLD is received then it is checked amongst the /proc/ directories which child process was stopped (the missing one) (it could be also checked using kill -0 <PID> built-in). After wait the exit status of the background process is stored in the famous $? pseudo variable.

The main script waits for all pids to stop (otherwise it could not get the exit status of its children) and the it stops itself.

An example output:

WAITING FOR: 13102 13103 13104
SLEEPING [13103]: 1/2
SLEEPING [13102]: 1/3
Stopped 13104; exit code: 1
WAITING FOR: 13102 13103
WAITING FOR: 13102 13103
SLEEPING [13103]: 2/2
SLEEPING [13102]: 2/3
WAITING FOR: 13102 13103
WAITING FOR: 13102 13103
SLEEPING [13102]: 3/3
Stopped 13103; exit code: 0
WAITING FOR: 13102
WAITING FOR: 13102
WAITING FOR: 13102
Stopped 13102; exit code: 0
STOPPED

It can be seen that the exit codes are reported correctly.

I hope this can help a bit!

TrueY
  • 7,360
  • 1
  • 41
  • 46
  • Thanks for your help, I did what you wrote, I also added a wait for that process 'wait $PID' if the return code of 'ps -p $PID' if not zero -> something happened to that process. When I get the return code I will post again here – Jan Oct 21 '14 at 16:10
  • @Jan: Could You solve the problem? What is the returned error code? – TrueY Oct 24 '14 at 09:30
  • @Jan: Could You solve the problem? What is the returned error code? – TrueY Nov 14 '14 at 12:37
  • Yes. The problem was with the pidof command. In my question I wrote that I was checking the return code $? of the nohup, which in the real script I wasn't, instead I was checking the $? of the pidof right after the nohup. To sum up, everything was working correctly, except the part that rather that checking wheter the ?! exist I was doing pidof "name-of-the-command" to see if it was correctly started, which lead to a race condition and that is why the errors were random. I will edit my question, so it would reflect my actual code. Anyway your answer really helped me, many thanks. – Jan Nov 15 '14 at 23:44
  • @Jan: Thanks! I also modified a little bit my code to be a little bit smarter. – TrueY Nov 16 '14 at 20:59
  • nice one. Thanks a lot – Alejandro Teixeira Muñoz Sep 06 '16 at 07:59