How to access URLs sequentially

Question

I have about 10K URLs in an array. At some other time this may be 100K. I need to visit them programmatically and obtain the response and print it out or do something with it. To prevent choking of the server to which all the URLs belong, I would like to visit them sequentially. I know there is the async module to do this. My question is: Is async the only way to do this? Will async be able to scale for a higher number of URLs?

There's no need for the async library to do a simple sequential iteration through an array of requests. You could use it, but it is not necessary and there is no scale issue involved in a sequential iteration one after another. — jfriend00, Sep 05 '16 at 06:10
See [How can I throttle stack of API requests](http://stackoverflow.com/questions/35422377/how-can-i-throttle-stack-of-api-requests/35422593#35422593) and [Run 1000 requests so that only 10 runs at a time](http://stackoverflow.com/questions/39141614/run-1000-requests-so-that-only-10-runs-at-a-time/39154813#39154813) and [Make several requests to an API that can only handle 20 request a minute](http://stackoverflow.com/questions/33378923/make-several-requests-to-an-api-that-can-only-handle-20-request-a-minute/33379149#33379149) for implementations of something like you're doing. — jfriend00, Sep 05 '16 at 07:46
async is good. If you're running something sequentially, why would you have to worry about scaling? It's just going to take more time. — Christopher Reid, Sep 05 '16 at 10:45

score 0 · Accepted Answer · answered Sep 05 '16 at 07:31

0

Use a web crawler module like crawler (or search for crawler keyword on node-modules.com or npmjs.com).

var Crawler = require("crawler");
var url = require('url');

var c = new Crawler({
    maxConnections : 10,
    // This will be called for each crawled page
    callback : function (error, result, $) {
        // $ is Cheerio by default
        //a lean implementation of core jQuery designed specifically for the server
        $('a').each(function(index, a) {
            var toQueueUrl = $(a).attr('href');
            c.queue(toQueueUrl);
        });
    }
});

// Queue a list of URLs
c.queue(['http://jamendo.com/','http://tedxparis.com']);

answered Sep 05 '16 at 07:31

Jason Livesay

6,317
3
25
31

nothing in the question says anything about scraping, or parsing HTML. – Christopher Reid Sep 05 '16 at 10:49
He says 'or do something with it' so that could very well be scraping, and crawler has built-in configuration for controlling how many requests go out at the same time, etc. – Jason Livesay Sep 05 '16 at 22:10

How to access URLs sequentially

1 Answers1