3

I have been using Perl::Net::SSH to automate running some scripts on my remote boxes. However, some of these scripts take a really long time to complete (hour or two) and sometimes, I stop getting data from them, without actually losing the connection.

Here's the code I'm using:

sub run_regression_tests {
    for(my $i = 0; $i < @servers; $i++){
        my $inner = $users[$i];
        foreach(@$inner){
            my $user = $_;
            my $server = $servers[$i];

            my $outFile;
            open($outFile, ">" . $outputDir . $user . "@" . $server . ".log.txt");
            print $outFile "Opening connection to $user at $server on " . localtime() . "\n\n";
            close($outFile);

            my $pid = $pm->start and next;

                print "Connecting to $user@" . "$server...\n";

                my $hasWentToDownloadYet = 0;
                my $ssh = Net::SSH::Perl->new($server, %sshParams);
                $ssh->login($user, $password);              

                $ssh->register_handler("stdout", sub {
                    my($channel, $buffer) = @_;             
                    my $outFile;
                    open($outFile, ">>", $outputDir . $user . "@" . $server . ".log.txt");                  
                    print $outFile $buffer->bytes;              
                    close($outFile);                

                    my @lines = split("\n", $buffer->bytes);
                    foreach(@lines){
                        if($_ =~ m/REGRESSION TEST IS COMPLETE/){
                            $ssh->_disconnect();

                            if(!$hasWentToDownloadYet){
                                $hasWentToDownloadYet = 1;
                                print "Caught exit signal.\n";
                                print("Regression tests for ${user}\@${server} finised.\n");
                                download_regression_results($user, $server);
                                $pm->finish;
                            }
                        }
                    }

                });
                $ssh->register_handler("stderr", sub {
                    my($channel, $buffer) = @_;             
                    my $outFile;
                    open($outFile, ">>", $outputDir . $user . "@" . $server . ".log.txt");

                    print $outFile $buffer->bytes;              

                    close($outFile);                
                });
                if($debug){
                    $ssh->cmd('tail -fn 40 /GDS/gds/gdstest/t-gds-master/bin/comp.reg');
                }else{
                    my ($stdout, $stderr, $exit) = $ssh->cmd('. ./.profile && cleanall && my.comp.reg');
                    if(!$exit){
                        print "SSH connection failed for ${user}\@${server} finised.\n";
                    }
                }
                #$ssh->cmd('. ./.profile');

                if(!$hasWentToDownloadYet){
                    $hasWentToDownloadYet = 1;
                    print("Regression tests for ${user}\@${server} finised.\n");
                    download_regression_results($user, $server);
                }

            $pm->finish;        
        }
    }
    sleep(1);
    print "\n\n\nAll tests started. Tests typically take 1 hour to complete.\n";
    print "If they take significantly less time, there could be an error.\n";
    print "\n\nNo output will be printed until all commands have executed and finished.\n";
    print "If you wish to watch the progress tail -f one of the logs this script produces.\n Example:\n\t" . 'tail -f ./gds1@tdgds10.log.txt' . "\n";
    $pm->wait_all_children;
    print "\n\nAll Tests are Finished. \n";
}

And here is my %sshParams:

my %sshParams = (
    protocol => '2',
    port => '22',
    options => [
        "TCPKeepAlive yes",
        "ConenctTimeout 10",
        "BatchMode yes"
    ]
);

Sometimes randomly one of the long running commands just halts printing/firing the stdout or stderr events and never exits. The ssh connection doesn't die (as far as I'm aware) because the $ssh->cmd is still blocking.

Any idea how to correct this behaviour?

RobEarl
  • 7,862
  • 6
  • 35
  • 50
Malfist
  • 31,179
  • 61
  • 182
  • 269
  • Do you have shell access to the server running this command? If you do you can see if the ssh command is present via `ps auxgmww | grep ssh`. You can at least test your assumption that the ssh process is still working. Assuming that is working and good, you can run the ps to get the process ID of your program, and then run `strace -fp $PID` (substitute the the PID of your program into $PID). See if that sheds any light on what it might be stuck on. – Neil Neely Aug 22 '11 at 21:03
  • Log into the remote server too and see if what processes are running, and perform an strace on those as well to see if that sheds any light on where it is stuck. Any chance that one of the regression tests you are running might want something on STDIN in some circumstances? – Neil Neely Aug 22 '11 at 21:05
  • Is `ConenctTimeout` a typo in your question, or in your actual settings? – TLP Aug 22 '11 at 21:16
  • Looks like Net::SSH::Perl doesn't fork an ssh process, my mistake. Should still be beneficial to do an strace on your master process to see what it's doing when it's "stuck". – Neil Neely Aug 22 '11 at 21:28
  • This reminded me of some csv+ssh problem I read about recently: [CVS SSH](http://www.airs.com/blog/archives/521). Not sure its related, though – daniel kullmann Aug 22 '11 at 21:38
  • @TLP, it's a typo, I've fixed it – Malfist Aug 23 '11 at 13:39

2 Answers2

0

In your %sshParams hash, you may need to add "TCPKeepAlive yes" to your options:

$sshParams{'options'} = ["BatchMode yes", "TCPKeepAlive yes"];

Those options might or might not be right for you, but the TCPKeepAlive is something I would recommend setting for any long running SSH connection. If you have any kind of stateful firewall in your path it could drop the state if it hasn't passed traffic over the connection for a long period of time.

Neil Neely
  • 336
  • 1
  • 4
  • Both options are set to this. I appended my %sshParams to the question. – Malfist Aug 22 '11 at 20:21
  • Bit of a long shot, but an alternative to the TCPKeepAlive approach is: `["ServerAliveCountMax 3", "ServerAliveInterval 300"]`. I'm not confident this is the real problem though. – Neil Neely Aug 22 '11 at 21:27
0

It fails probably due to the way you look into the output for the REGRESSION TEST IS COMPLETE mark. It may be split over two different SSH packets and so your callback will never found it.

Better, use a remote command that ends when it is done as this one-liner:

perl -pe 'BEGIN {$p = open STDIN, "my.comp.reg |" or die $!}; kill TERM => -$p if /REGRESSION TEST IS COMPLETE/}'

Otherwise, you are closing the remote connection but not stopping the remote process that will stay alive.

Besides that, you should try using Net::OpenSSH or Net::OpenSSH::Parallel instead of Net::SSH::Perl:

use Net::OpenSSH::Parallel;

my $pssh = Net::OpenSSH::Parallel->new;

for my $i (0..$#server) {
    my $server = $server[$i];
    for my $user (@{$users[$ix]}) {
        $pssh->add_host("$user\@$server", password => $password);
    }
}

if ($debug) {
    $pssh->all(cmd => { stdout_file => "$outputDir%USER%\@%HOST%.log.txt",
                        stderr_to_stdout => 1 },
               'fail -fn 40 /GDS/gds/gdstest/t-gds-master/bin/comp.reg');
}
else {
    $pssh->all(cmd => { stdout_file => "$outputDir%USER%\@%HOST%.log.txt",
                        stderr_to_stdout => 1 },
               '. ./.profile && cleanall && my.comp.reg');
}

$pssh->all(scp_get => $remote_regression_results_path, "regression_results/%USER%\@%HOST%/");

$pssh->run;
salva
  • 9,943
  • 4
  • 29
  • 57
  • Looking for the REGRESSION TEST IS COMPLETE isn't causing this. If I tail the output, it stops reporting output in the middle of test, not at the end. – Malfist Aug 23 '11 at 13:40
  • I didn't go with Net::OpenSSH and Net::OpenSSH::Parallel because it doesn't support windows, not even under cygwin, which is where this script will usually be running. – Malfist Aug 23 '11 at 13:44
  • @Malfist: [Net::SSH2](http://search.cpan.org/perldoc?Net::SSH2) is another alternative. Nowadays nobody maintains Net::SSH::Perl and it has lots of bugs. – salva Aug 23 '11 at 14:04