0

I have about 10K URLs in an array. At some other time this may be 100K. I need to visit them programmatically and obtain the response and print it out or do something with it. To prevent choking of the server to which all the URLs belong, I would like to visit them sequentially. I know there is the async module to do this. My question is: Is async the only way to do this? Will async be able to scale for a higher number of URLs?

Yash
  • 946
  • 1
  • 13
  • 28
  • There's no need for the async library to do a simple sequential iteration through an array of requests. You could use it, but it is not necessary and there is no scale issue involved in a sequential iteration one after another. – jfriend00 Sep 05 '16 at 06:10
  • See [How can I throttle stack of API requests](http://stackoverflow.com/questions/35422377/how-can-i-throttle-stack-of-api-requests/35422593#35422593) and [Run 1000 requests so that only 10 runs at a time](http://stackoverflow.com/questions/39141614/run-1000-requests-so-that-only-10-runs-at-a-time/39154813#39154813) and [Make several requests to an API that can only handle 20 request a minute](http://stackoverflow.com/questions/33378923/make-several-requests-to-an-api-that-can-only-handle-20-request-a-minute/33379149#33379149) for implementations of something like you're doing. – jfriend00 Sep 05 '16 at 07:46
  • async is good. If you're running something sequentially, why would you have to worry about scaling? It's just going to take more time. – Christopher Reid Sep 05 '16 at 10:45

1 Answers1

0

Use a web crawler module like crawler (or search for crawler keyword on node-modules.com or npmjs.com).

var Crawler = require("crawler");
var url = require('url');

var c = new Crawler({
    maxConnections : 10,
    // This will be called for each crawled page
    callback : function (error, result, $) {
        // $ is Cheerio by default
        //a lean implementation of core jQuery designed specifically for the server
        $('a').each(function(index, a) {
            var toQueueUrl = $(a).attr('href');
            c.queue(toQueueUrl);
        });
    }
});

// Queue a list of URLs
c.queue(['http://jamendo.com/','http://tedxparis.com']);
Jason Livesay
  • 6,317
  • 3
  • 25
  • 31