0

I am running a Monte carlo on Multiple processors, but it hangs up a lot. So I put together this perl code to kill the iteration that hangs up the monte carlo and go to the next iteration. But I get some errors, I have not figure out yet. I think it sleeps too long and it will delete the out.mt0 file before it will look for it. This is the code:

my $pid = fork();
die "Could not fork\n" if not defined $pid;

if ($pid == 0) {
    print "In child\n";   
    system("hspice -i mont_read.sp -o out -mt 4"); wait;
    sleep(.8); wait;
    exit(0);
}

print "In parent \n";

$i = 0;    
$mont_number = $j - 1;

out: while (1) {
    $res = waitpid($pid, WNOHANG);    
    if ($res == -1) {
        print "Successful Exit Process Detected\n";
        system("mv out.mt0 mont_read.mt0"); wait;
        sleep(1); wait;
        system("perl monte_stat.pl > rel_out.txt"); wait ;
        system("cat stat_result.txt rel_out.txt > stat_result.tmp"); wait; 
        system("mv stat_result.tmp stat_result.txt"); wait;
        print "\nSim #$mont_number complete\n"; wait;
        last out;    
    }

    if ($res != -1) {    
        if ($i >= $timeout) {
            $hang_count = $hang_count+1;
            system("killall hspice"); wait;
            sleep(1);
            print("time_out complete\n"); wait;
            print "\nSim #$mont_number complete\n"; wait;
            last out; 
        }

        if ($i < $timeout) {
            sleep $slept; wait;
        }
        $i = $i+1;
    }
}

This is the error:

Illegal division by zero at monte_stat.pl line 73,  line 2.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73,  line 1.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73,  line 1.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73.
mv: cannot stat `out.mt0': No such file or directory
mv: cannot stat `out.mt0': No such file or directory
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73,  line 3.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73,  line 1.
mv: cannot stat `out.mt0': No such file or directory

Could anyone give me an idea where to look to debug it. thanks

zdim
  • 64,580
  • 5
  • 52
  • 81
Aliyar Attaran
  • 41
  • 1
  • 3
  • 8

1 Answers1

3

According to the errors it appears that your hslice is crashing. But there are other issues.

Here is first a working example as close as possible to your code.

use warnings;
use strict;
use feature 'say';
use POSIX qw(:sys_wait_h);
$| = 1;

my ($timeout, $duration, $sleep_time) = (5, 10, 1);

my $pid = fork // die "Can't fork: $!";

if ($pid == 0)  
{
    exec "echo JOB STARTS; sleep $duration; echo JOB DONE";
    die "exec shouldn't return: $!";
}    
say "Started $pid";
sleep 1;

my $tot_sec;    
while (1) 
{
    my $ret = waitpid $pid, WNOHANG;

    if    ($ret > 0) { say "Child $ret exited with: $?";  last; }
    elsif ($ret < 0) { say "\nNo such process ($ret)";    last; }
    else             { print " . " }

    sleep $sleep_time;

    if (($tot_sec += $sleep_time) > $timeout) {
        say "\nTimeout. Send 15 (SIGTERM) signal to the process.";
        kill 15, $pid;
        last;
    }   
}

With $duration (of the job) set to 3, shorter than $timeout, we get

Started 16848
JOB STARTS
 .  .  . JOB DONE
Child (JOB) 16848 exited with: 0

while with $duration set to 10 we get

Started 16550
JOB STARTS
 .  .  .  .  .
Timeout. Send 15 (SIGTERM) signal to the process.

and the job is killed (wait for 5 more seconds – the JOB DONE shouldn't show up).

Comments on the code in the question

  • If you fork only to run a job there is no reason for system. Just exec that program

  • No need for wait after system, and it's wrong. The system includes a wait

  • The wait doesn't belong after print and sleep, and it's wrong

  • No need to shell out for killall in order to kill a process

  • If you end up using system the program will run in a new process with another PID. Then more is needed to find that PID and kill it. See Proc::ProcessTable and this post, for example

  • The code above needs checks of whether the process was indeed killed

Substitute your command line instead of echo ... and add checks for it as needed.

Another option is to simply sleep for a $timeout period and then check whether the job is done (child exited). However, with your approach you can do other things while polling.

Another option is to use alarm.

zdim
  • 64,580
  • 5
  • 52
  • 81