Watchdog-like support (bash perl or python)

Question

I have a bash script (that I'm converting to perl) that runs in an infinite loop (while true; do) to poll devices on our network and log their responses to a text file. With each iteration of the (while) loop, the text file for that device is appended with its latest information.

I would like to have this script always run -- if it gets hung, crashes, or is no longer writing to the appropriate text files -- it should be restarted.

Following the advice posted in this StackOverflow question, I could write the following bash script:

until myserver; do
  echo "Server 'myserver' crashed with exit code $?.  Respawning.." >&2
  sleep 1
done

where myserver is the polling program. This would account for issues with the script unexpectedly crashing or hanging, presuming a non-zero exit code were issued in those cases. However, if the script doesn't completely fail/exit, but fails in a way that just stops it from writing out to the text files -- I'd like to restart the script in this case as well. This is where a watchdog-like script would come in. I could use Python's watchdog and write a script that uses Observer library to monitor the text files being generated, like in this example. I would trigger on stagnant text files to issue a non-zero exit for the python script and then augment the above bash script as follows:

until [myserver -o pythonMon]; do
  echo "Server 'myserver' crashed with exit code $?.  Respawning.." >&2
  sleep 1
done

where pythonMon is the python script monitoring whether or not the text files are updating properly. Unfortunately, this approach requires three scripts (the main polling script and two monitoring scripts); it's a bit of a kludge . I'm looking to optimize/simplify this approach. Any recommendations? Ideally, I'd have a single script (at least a single monitoring script) to keep the polling script running versus two. Would there be a way to add file monitoring directly into the bash or perl code? This is running on 64-bit, CentOS 6.5

I presume there's a pause of some sort between the polls. How about taking the loop out of your program, so that it just interrogates the hardware once, records the result, and exits. Then you can run it as a `cron` job which will run at the same frequency, and the question of restarting doesn't arise. All that happens if a poll fails is that an entry is missing from the data sequence. — Borodin, Jan 25 '15 at 18:07
@Borodin there is a pause, but the polling occurs every 10 seconds, so a cron job would not work, as it only allows the script to be executed once a minute (vice the needed 10 seconds). — secJ, Jan 25 '15 at 18:54
Then your Perl program should `fork` a child Perl process every ten seconds and let that do the poll. The parent process can `kill` and harvest the most recent child before spawning a new one, although your requirement to restart *“if [the process is] no longer writing to the appropriate text files ”* is a little worrying. What sort of error do you imagine here, and how could it be tested? I would write an answer with some sample code, but I am using a tablet at present and can't test anything. — Borodin, Jan 25 '15 at 19:10
@Borodin thank you. I'm not overly familiar with forking a sub-process in perl, but will look it up. As for the error of not appropriately writing files, I'm not sure of the root cause just yet, but have seen cases where the PID still exists for the polling bash file, yet the output text files are no longer being updated. That led me to monitoring the writing of those text files and restarting if they are stagnant. I'm making the assumption that it could be a hiccup in the network or a polled device that fails and for whatever reason halts the polling script. Does that make sense? — secJ, Jan 25 '15 at 19:21

score 1 · Accepted Answer · answered Jan 25 '15 at 20:40

I'm doing something rather similar for monitoring a bunch of devices. Depends a bit on polling frequency though - I'm spawning via cron, at 3m intervals.

Bear in mind 10s samples are potentially quite intensive, and may not always be necessary - it does depend a bit on what you're aiming for.

Anyway, the tool for the job is Parallel::ForkManager.

#!/usr/bin/perl

use strict;
use warnings;

use Parallel::ForkManager;

my @targets = qw( server1 server2 );

my %test_list = { 'fetch_cpu' => \&fetch_cpu_stats, };


sub fetch_cpu_stats {
    my ($target) = @_;
    ## do something to $target;
    open( my $ssh_results, "-|", "ssh -n $target uptime" )
        or die $!;
    while (<$ssh_results>) {
        print;
    }
}

my $manager = Parallel::ForkManager->new(10);

while (1) {
    foreach my $test ( keys %test_list ) {
        foreach my $target (@targets) {
            $manager->start and next;
            print "$$ starting $test\n";
            &{$test_list{$test}}($target);
            $manager -> finish;
        }
    }
    sleep 10;
}

This'll spawn up to 10 concurrent 'tests', and re-run them every 10s. It's probably worth some sort of 'lock' process (using flock) to make it easy to use cron check if your 'daemon' script is still running.

That would be something like:

open ( my $self, "<", $0 ) or die $!;
flock ( $self, 2 | 4 ) or die "$0 already running";

You can then fire it in cron every so often, and it'll restart itself if it's died for some reason.

But anyway - you can have multiple subroutines (e.g. your test scripts) all spawned autonomously (and for bonus points - they'll run in parallel).

Thank you. I will give this a shot. This may do exactly what I need. As @Borodin mentioned above, this doesn't entirely account for the "write" failure I described, but as I think you're comment elude to, I may be able to run another subroutine that tests the output files are being appended properly. I will work the problem more and let you know if this solves my issue. Thank you again. — secJ, Jan 25 '15 at 21:17
I'm having a bit of trouble with the dereferencing call to `sub fetch_cpu_stats` happening in line 30 `&{$test_list{$test}}($target);`. I receive the following error: "Can't use an undefined value as a subroutine reference". I'm not sure where the issue is here. The syntax looks correct to me and `$test` seems to get the appropriate coderef. However, if I call `&fetch_cpu_stats` directly instead of going through the `%test_list` hash dereference it seems to work ok. Any thoughts about this? Again, I'm very rusty and may be missing something obvious here. — secJ, Jan 27 '15 at 01:35
That should work, but as an alternative - `$test_list{$test} -> ($target);` should do similar. — Sobrique, Jan 27 '15 at 18:50
Yeah, I tried that too, but get the same result. Oh well, that's in the weeds and I'll work through it. Appreciate your help with the overall problem. I'll explore some options to verify the log files are being written periodically as well, maybe a subroutine that's paired with the cronjob to verify this script is running. Thanks. — secJ, Jan 27 '15 at 22:08
It may be there's a syntax error in the code somewhere, so it might be worth making another post. — Sobrique, Jan 28 '15 at 18:12

Watchdog-like support (bash perl or python)

1 Answers1