1

Here's the guts of the program using Parallel::ForkManager. It seems to stop at 200 proccesses, sometimes its around 30, depending on the size of the pgsql query that collects URLs to send to Mojo::UserAgent. There seems to be some hard limits somewhere? Is there a better way to write this so that I don't run into those limits? The machine its running on has 16 CPUs and 128GB of memory, so it can certainly run more than 200 proccesses that will die after the Mojo::UserAgent timeout, which is generally 2 seconds.

use Parallel::ForkManager;
use Mojo::Base-strict;
use Mojo::UserAgent;
use Mojo::Pg;
use Math::Random::Secure qw(rand irand);
use POSIX qw(strftime);
use Socket;
use GeoIP2::Database::Reader;
use File::Spec::Functions qw(:ALL);
use File::Basename qw(dirname);

use feature 'say';


$max_kids = 500;
sub do_auth {
...
        push( @url, $authurl );
}


do_auth();

my $pm = Parallel::ForkManager->new($max_kids);

LINKS:
foreach my $linkarray (@url) {
    $pm->start and next LINKS;    # do the fork
    my $ua = Mojo::UserAgent->new( max_redirects => 5, timeout => $timeout );
    $ua->get($url);
    $pm->finish;
}

$pm->wait_all_children;
ikegami
  • 367,544
  • 15
  • 269
  • 518
ajmcello
  • 71
  • 5
  • Why would you fork 200 processes when you only have 16 CPUs??? – ThisSuitIsBlackNot Dec 20 '16 at 22:49
  • 1
    @ThisSuitIsBlackNot, Because most are sleeping waiting for an HTTP response. – ikegami Dec 20 '16 at 23:21
  • 3
    @ajmcello, You'd be better off using a client capable of performing multiple requests without creating an entire process to do it (e.g. [Net::Curl::Multi](http://search.cpan.org/perldoc?Net::Curl::Multi)). – ikegami Dec 20 '16 at 23:23
  • @ajmcello, Is that suppose to be an answer to ThisSuitIsBlackNot? Because that doesn't answer the question at all. If anything, very quick responses suggest using a smaller number of workers. – ikegami Dec 20 '16 at 23:26
  • 2
    @ajmcello, If you actually want help with this (other than suggestions that you aren't using the right tool as provided above), you'll have to specify what's failing, and for what reason (i.e. with what error). – ikegami Dec 20 '16 at 23:29
  • What's your resource limit on user processes set to? (You don't specify your platform, but maybe `ulimit -Su` will tell you.) – David Schwartz Dec 20 '16 at 23:30
  • They are either sleeping or exiting because they established a connection and finished or met the timeout value. Perhaps ASYNC or something is a better way to go but I'm a below average, novice perl user. – ajmcello Dec 20 '16 at 23:31
  • @davidschwartz it's run as root. – ajmcello Dec 20 '16 at 23:32
  • While root is permitted to raise its own resource limits, it still has resource limit settings. What are they? – David Schwartz Dec 20 '16 at 23:34
  • @ikegami it works but it's slow. I had an old way of doing this which had a maxprox value that worked when set, regardless of the value set. At 500 or 800 processes, the program would complete in 30-60 minutes. When it runs with 20 or 30, even though maxproc or maxkids is set higher, the program takes 24 hours to run. So it seems the more sub processing I get, the faster it runs. The old program, or the old way, I lost and had rewrite. – ajmcello Dec 20 '16 at 23:43
  • @DavidSchwartz maxproc is set to 225252c everything else is unlimited or very very high. – ajmcello Dec 20 '16 at 23:48
  • I'd prefer to continue using Mojolicious. I'm not sure Net::Curl::Multi is going to work as I set user agent and a proxy, however it might if I can set those. I haven't used libcurl at all and am looking at it now. – ajmcello Dec 20 '16 at 23:57
  • It looks like libcurl supports those, so I will try using Net::Curl::Multi and see if that doesn't have the hard limits I encountered with Parallel::ForkManager. Thanks @ikegami – ajmcello Dec 21 '16 at 00:06

2 Answers2

0

For your example code (fetching a URL) I would never use Forkmanager. I would use Mojo::IOLoop::Delay or non-blocking calling style.

use Mojo::UserAgent;
use feature 'say';

my $ua = Mojo::UserAgent->new;

$ua->inactivity_timeout(15);
$ua->connect_timeout(15);
$ua->request_timeout(15);
$ua->max_connections(0);

my @url = ("http://stackoverflow.com/questions/41253272/joining-a-view-and-a-table-in-mysql",
           "http://stackoverflow.com/questions/41252594/develop-my-own-website-builder",
           "http://stackoverflow.com/questions/41251919/chef-mysql-server-configuration",
           "http://stackoverflow.com/questions/41251689/sql-trigger-update-error",
           "http://stackoverflow.com/questions/41251369/entity-framework-how-to-add-complex-objects-to-db",
           "http://stackoverflow.com/questions/41250730/multi-dimensional-array-from-matching-mysql-columns",
           "http://stackoverflow.com/questions/41250528/search-against-property-in-json-object-using-mysql-5-6",
           "http://stackoverflow.com/questions/41249593/laravel-time-difference",
           "http://stackoverflow.com/questions/41249364/variable-not-work-in-where-clause-php-joomla");

foreach my $linkarray (@url) {
    # Run all requests at the same time
    $ua->get($linkarray => sub {
    my ($ua, $tx) = @_;
    say $tx->res->dom->at('title')->text;
   });
}
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
user3606329
  • 2,405
  • 1
  • 16
  • 28
  • if there are tens of thousands of urls, running out of client ports becomes an issue; how would you modify this code to impose a limit on number of urls to be fetching at once? – ysth Dec 21 '16 at 00:52
  • What happens if a URL is slow or times out with Mojo::IOLoop::Delay? Will it hang up MIOLD until the URL is finished to go into the next? Does MIOLD process in parallel or simultaneously or is it synchronous? – ajmcello Dec 21 '16 at 00:54
  • I would control the execution flow by declaring a global variable called "our $counter = 0;" then increment the counter before calling the non-blocking function and decrement it in the callback once the execution is finished. When $counter reached the defined maximum wait before fetching new URLs. It's just an idea, might work or not. – user3606329 Dec 21 '16 at 02:04
  • @ajmcello the requests are concurrent. When a timeout occurs the callback of the bad URL will response after the connection is dropped. The other URLs are being executed without interruption. – user3606329 Dec 21 '16 at 02:26
-1

Most likely you are running into an operating system limit on threads or processes. The quick and dirty way to fix this would be to increase the limit, which is usually configurable. That said, rewriting the code not to use so many short lived threads is a more scalable solution.

Warren Dew
  • 8,790
  • 3
  • 30
  • 44