1

i found an answer here on using threads for http requests

Just want to ask, in the accepted answer:

for my $url ('http://www.google.com/', 'http://www.perl.org/') {
   push @threads, async { $ua->get($url) };
}

if i have more than 20K urls to fetch, is this approach of pushing to array @threads inside this for loop advisable? Or should i restructure it to handle more than 20K list items? How can i do it such that it doesn't crash my system? thanks

Community
  • 1
  • 1
dorothy
  • 1,213
  • 5
  • 20
  • 35

3 Answers3

2

That is quite a few threads to launch. It's probably below the thread limit for your system, so it depends how many resources you have available for the job.

If you'd rather use a worker pool, Parallel:ForkManager is a popular module for that.

The module's documentation offers this example for a mass-downloader:

use LWP::Simple;
use Parallel::ForkManager;

...

@links=(
  ["http://www.foo.bar/rulez.data","rulez_data.txt"],
  ["http://new.host/more_data.doc","more_data.doc"],
  ...
);

...

# Max 30 processes for parallel download
my $pm = Parallel::ForkManager->new(30);

foreach my $linkarray (@links) {
  $pm->start and next; # do the fork

  my ($link,$fn) = @$linkarray;
  warn "Cannot get $fn from $link"
    if getstore($link,$fn) != RC_OK;

  $pm->finish; # do the exit in the child process
}
$pm->wait_all_children;

LWP::UserAgent doesn't have the same getstore sub that LWP::Simple provides, but it does have a mirror method which behaves similarly.

Community
  • 1
  • 1
rutter
  • 11,242
  • 1
  • 30
  • 46
1

You can easily do a worker pool with threads too.

use threads;

use Thread::Queue 3.01 qw( );

use constant NUM_WORKERS => 30;

sub process {
   my ($url) = @_;
   ... $ua->get($url) ...
}

my $q = Thread::Queue->new();

my @workers;
for (1..NUM_WORKERS) {
   async {
      while (my $job = $q->dequeue()) {
         process($job);
      }
   };
}

$q->enqueue($_) for @urls;
$q->end();

$_->join() for threads->list();

rutter's Fork::ParallelManager solution posted creates 20,000 workers total. This creates 30 ever.

That said, Net::Curl::Multi is much better at this.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • hi, thanks. how does the above code use the sub proc "process"? I don't see it being called. – dorothy Dec 13 '13 at 04:47
  • @dorothy, Fixed. Just a name mismatch – ikegami Dec 13 '13 at 05:18
  • thanks. i tried this method but always get the "Free to wrong pool 10f9730 not 3425e0 at..." error. I guess Perl threads are not really stable in Windows environment ? – dorothy Dec 13 '13 at 06:11
1

I would recommend letting POE to take care of this kind of stuff.

http://poe.perl.org/?POE_Cookbook

specifically

http://poe.perl.org/?POE_Cookbook/Web_Client

abasterfield
  • 2,214
  • 12
  • 17