7

Is it possible to implement some kind of timeout (time limit) for fork using Parallel::ForkManager ?

Basic Parallel::ForkManager script looks like this

use Parallel::ForkManager;
my $pm = Parallel::ForkManager->new( 10 );
for ( 1 .. 1000 ) {
    $pm->start and next;
    # some job for fork
    $pm->finish;
}
$pm->wait_all_children();

I would like to limit time for "# some job for fork". For example, if its not finished in 90 secs. then it (fork) should be killed/terminated. I thought about using this but I have to say, that I dont know how to use it with Parallel::ForkManager.

EDIT

Thanks hobbs and ikegami. Both your suggestions worked..... but only in this basic example, not in my actual script :(. screenshot These forks will be there forever and - to be honest - I dont know why. I use this script for couple of months. Didnt change anything (although many things depends on outside variables). Every fork has to download a page from a website, parse it and save results to a file. It should not take more than 30 secs per fork. Timeout is set to 180 secs. Those hanging forks are totally random so its very hard to trace the problem. Thats why I came up with a temporary, simple solution - timeout & kill.

What could possibly disable (interrupt) your methods of timeout in my code ? I dont have any other alarm() anywhere in my code.

EDIT 2

One of the forks, was hanging for 1h38m and returned "timeout PID" - which is what I type in die() for alarm(). So the timeout works... but its late about 1h36,5m ;). Do you have any ideas?

Community
  • 1
  • 1
gib
  • 738
  • 1
  • 8
  • 16
  • Re: Edit 2, are you using LWP::UA? If so, see here: http://stackoverflow.com/questions/73308 – pilcrow Jun 10 '12 at 22:21
  • what does "which is what I type in die() for alarm()" mean? – ikegami Jun 10 '12 at 22:31
  • @pilcrow Im using LWP::UA (via WWW::Mechanize). Earlier, when I was tracing this issue, I tested "timeout" on WWW::Mech http request. Timeout worked, but fork hanged after all. – gib Jun 10 '12 at 22:40
  • @ikegami I have `if ($@) { die "timeout $$\n"; }` after `eval{ }` ([from this example](http://stackoverflow.com/questions/2423288/ways-to-do-timeouts-in-perl). Sorry for my english, its bad and sometimes its hard for me to explain some things ;). – gib Jun 10 '12 at 22:43
  • 1
    huh? That has nothing to do with alarm. But that's not important. You just said you use LWP, and it uses `alarm`, wiping out your alarm. – ikegami Jun 10 '12 at 22:44
  • hmm.. [here](http://stackoverflow.com/questions/2423288/ways-to-do-timeouts-in-perl) @knorv wrote: `# handle timeout condition`. And it seems, that this code is launched if time limit is exceeded. In [this example](http://pastebin.com/S0K0Um04) its printing PIDs of forks which timeout. – gib Jun 10 '12 at 22:54
  • @ikegami I wrote that *earlier* I tested `alarm()` in WWW:Mech (LWP:UA) http request. Since it didnt help for my "hanging forks issue" I remove it right away. – gib Jun 10 '12 at 22:59
  • @gibson, In his code, he has a die in an alarm handler. That would catch that exception (and any other exception since you don't check what exception you got), but 1) has nothing to do with alarm 2) my code has no such handler, yet you said you're using my code. – ikegami Jun 10 '12 at 22:59
  • @ikegami Im sorry I wasnt clear enough. I tested both timeout solutions: Yours and hobbs. After playing with [this](http://pastebin.com/S0K0Um04) I implement this to my script. Anyway, in EDIT#2 (my first post) I meant, that after 1h38m the hanged fork returned "timeout PID" (from `if ($@) { die "timeout $$\n"; }`). And from [this](http://pastebin.com/S0K0Um04) I thought, that `if ($@) { die "timeout $$\n"; }` is executed when timeout is achieved. – gib Jun 10 '12 at 23:15
  • No, it's executed when an exception is caught. – ikegami Jun 11 '12 at 00:22

3 Answers3

8

Update

Sorry to update after the close, but I'd be remiss if I didn't point out that Parallel::ForkManager also supports a run_on_start callback. This can be used to install a "child registration" function that takes care of the time()-stamping of PIDs for you.

E.g.,

$pm->run_on_start(sub { my $pid = shift; $workers{$pid} = time(); });

The upshot is that, in conjunction with run_on_wait as described below, the main loop of a P::FM doesn't have to do anything special. That is, it can remain a simple $pm->start and next, and the callbacks will take care of everything else.

Original Answer

Parallel::ForkManager's run_on_wait handler, and a bit of bookkeeping, can force hanging and ALRM-proof children to terminate.

The callback registered by that function can be run, periodically, while the $pm awaits child termination.

use strict; use warnings;
use Parallel::ForkManager;

use constant PATIENCE => 90; # seconds

our %workers;

sub dismiss_hung_workers {
  while (my ($pid, $started_at) = each %workers) {
    next unless time() - $started_at > PATIENCE;
    kill TERM => $pid;
    delete $workers{$pid};
  }
}

...

sub main {
  my $pm = Parallel::ForkManager->new(10);
  $pm->run_on_wait(\&dismiss_hung_workers, 1);  # 1 second between callback invocations

  for (1 .. 1000) {
    if (my $pid = $pm->start) {
      $workers{$pid} = time();
      next;
    }
    # Here we are child.  Do some work.
    # (Maybe install a $SIG{TERM} handler for graceful shutdown!)
    ...
    $pm->finish;
  }

  $pm->wait_all_children;

}

(As others suggest, it's better to have the children regulate themselves via alarm(), but that appears intermittently unworkable for you. You could also resort to wasteful, gross hacks like having each child itself fork() or exec('bash', '-c', 'sleep 90; kill -TERM $PPID').)

pilcrow
  • 56,591
  • 13
  • 94
  • 135
  • Thanks!!! Its working :). Although Im still curious why `alarm()` solution didnt work. Maybe answer to that question could help me trace the "hanging code". – gib Jun 12 '12 at 12:46
  • Thanks. this code above worked for me too. but after the script was done, I found ssh processes still hanging on. I do not know for how long they remained, (as i stepped out, by the time i came back my VPN was down, so i guess terminal session ended and with that those hanging sessions ended too). – rajeev Oct 16 '12 at 02:19
4

All you need is one line:

use Parallel::ForkManager;
my $pm = Parallel::ForkManager->new( 10 );
for ( 1 .. 1000 ) {
    $pm->start and next;
    alarm 90;             # <---
    # some job for fork
    $pm->finish;
}
$pm->wait_all_children();

You don't need to set up a signal handlers since you do mean for the process to die.

It even works if you exec in the child. It won't work on Windows, but using fork on Windows is questionable in the first place.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • That worked, but only in this basic example - not in my actual script. Please take a look at my first post after #EDIT# – gib Jun 10 '12 at 21:38
  • I can think of a few possibilities, but they are unlikely unless something else uses `alarm`. It could be in a module, e.g. a database driver. I'll see what I can do about a parent-based solution. P::FM is definitely not written with that in mind. – ikegami Jun 10 '12 at 22:16
  • Im not using any database drivers. Only WWW::Mech, File::Slurp:Unicode, Digest::MD5, Encode, Data::Dumper, and my own module for operations on files, control WWW::Mech and parsing a html. – gib Jun 10 '12 at 22:46
1

Just do what the answer you linked to suggests, inside the child process (i.e. between the $pm->start and next and the end of the loop. There's nothing special you need to do to make it interact with Parallel::ForkManager, other than make sure you don't accidentally kill the parent instead :)

hobbs
  • 223,387
  • 19
  • 210
  • 288
  • That worked, but only in this basic example - not in my actual script. Please take a look at my first post after #EDIT# – gib Jun 10 '12 at 21:39