I have about 10K URLs in an array. At some other time this may be 100K. I need to visit them programmatically and obtain the response and print it out or do something with it. To prevent choking of the server to which all the URLs belong, I would like to visit them sequentially. I know there is the async module to do this. My question is: Is async the only way to do this? Will async be able to scale for a higher number of URLs?
Asked
Active
Viewed 158 times
0
-
There's no need for the async library to do a simple sequential iteration through an array of requests. You could use it, but it is not necessary and there is no scale issue involved in a sequential iteration one after another. – jfriend00 Sep 05 '16 at 06:10
-
See [How can I throttle stack of API requests](http://stackoverflow.com/questions/35422377/how-can-i-throttle-stack-of-api-requests/35422593#35422593) and [Run 1000 requests so that only 10 runs at a time](http://stackoverflow.com/questions/39141614/run-1000-requests-so-that-only-10-runs-at-a-time/39154813#39154813) and [Make several requests to an API that can only handle 20 request a minute](http://stackoverflow.com/questions/33378923/make-several-requests-to-an-api-that-can-only-handle-20-request-a-minute/33379149#33379149) for implementations of something like you're doing. – jfriend00 Sep 05 '16 at 07:46
-
async is good. If you're running something sequentially, why would you have to worry about scaling? It's just going to take more time. – Christopher Reid Sep 05 '16 at 10:45
1 Answers
0
Use a web crawler module like crawler
(or search for crawler keyword on node-modules.com or npmjs.com).
var Crawler = require("crawler");
var url = require('url');
var c = new Crawler({
maxConnections : 10,
// This will be called for each crawled page
callback : function (error, result, $) {
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
$('a').each(function(index, a) {
var toQueueUrl = $(a).attr('href');
c.queue(toQueueUrl);
});
}
});
// Queue a list of URLs
c.queue(['http://jamendo.com/','http://tedxparis.com']);

Jason Livesay
- 6,317
- 3
- 25
- 31
-
nothing in the question says anything about scraping, or parsing HTML. – Christopher Reid Sep 05 '16 at 10:49
-
He says 'or do something with it' so that could very well be scraping, and crawler has built-in configuration for controlling how many requests go out at the same time, etc. – Jason Livesay Sep 05 '16 at 22:10