How to download huge list of remote files using Node JS http get method without running into errors

Question

I'm trying to download a list of files generated by an internal processing system via HTTP get method in node js. For a single files or for a few files it works fine and there is an answer for that already here on stackoverflow. However, the problem occurs then you try to download a huge list of files with asyn requests, the system simply times out and throws an error.

So it's more of a scalability issue. The best way would be to download files one by one/or a few files in one go and move to the next batch, but I'm not sure how to do that. Here is the code I have so far which works fine for a few files but in this case I have ~850 files (a few MBs each), and it does not work-

const https = require("http");
var fs = require('fs');

//list of files
var file_list = [];

file_list.push('http://www.sample.com/file1');
file_list.push('http://www.sample.com/file2');
file_list.push('http://www.sample.com/file3');
.
.
.
file_list.push('http://www.sample.com/file850');


file_list.forEach(single_file => {
        const file = fs.createWriteStream('files/'+single_file ); //saving under files folder
        https.get(single_file, response => {
          var stream = response.pipe(single_file);

          stream.on("finish", function() {
            console.log("done");
          });
        });
    });

It runs fine for a few files and creates a lot of empty files in the files folder and then throws this error

events.js:288                                                              
      throw er; // Unhandled 'error' event                                 
      ^                                                                    
                                                                           
Error: connect ETIMEDOUT 192.168.76.86:80                                   
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1137:16)        
Emitted 'error' event on ClientRequest instance at:                        
    at Socket.socketErrorListener (_http_client.js:426:9)                  
    at Socket.emit (events.js:311:20)                                      
    at emitErrorNT (internal/streams/destroy.js:92:8)                      
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)              
    at processTicksAndRejections (internal/process/task_queues.js:84:21) { 
  errno: 'ETIMEDOUT',                                                      
  code: 'ETIMEDOUT',                                                       
  syscall: 'connect',                                                      
  address: '192.168.76.86',                                                 
  port: 80                                                                 
}

Seems like it gives a huge load to the network, probably downloading these one by one might also work. Please suggest the best scalable solution if possible. Thanks.

the issue is that you're loading them all at the same time, essentially DDOS-ing the server. You need to limit the threads and use a stack to process — Daniel, Apr 07 '21 at 17:22
@Daniel Thanks for the reply, that's kind of what I thought, so what will be the way to send a few requests in one go, keeping in mind the server also does not get overloaded as it also need to serve other requests at the same time (other than these file requests) — Arun Kumar, Apr 07 '21 at 17:28

score 1 · Answer 1 · answered Apr 07 '21 at 17:39

The issue is that you're loading them all at the same time, essentially DDoSing the server. You need to limit the threads and use a stack to process

Here is a simplified example of what that might look like (untested).

const MAX_THREADS = 3;

const https = require("http");
const fs = require("fs");

const threads = [];

//list of files
const file_list = [];

file_list.push("http://www.sample.com/file1");
file_list.push("http://www.sample.com/file2");
file_list.push("http://www.sample.com/file3");
// ...
file_list.push("http://www.sample.com/file850");

const getFile = (single_file, callback) => {
  const file = fs.createWriteStream("files/" + single_file); //saving under files folder
  https.get(single_file, (response) => {
    var stream = response.pipe(single_file);

    stream.on("finish", function () {
      console.log("done");
      callback(single_file);
    });
  });
};

const process = () => {
  if (!file_list.length) return;

  let file = file_list.unshift();

  getFile(file, process); // the loop
};

while (threads.length < MAX_THREADS) {
  const thread = "w" + threads.length;
  threads.push(thread);
  process();
}

you don't even need to use a worker array, just for loop to initiate them should be enough, but you could add an object into the treads pool, and use it to keep stats and handle advanced features like retries or throttling.

score 1 · Answer 2 · answered Apr 07 '21 at 17:39

You're sending a zillion requests to the target server all at once. This will massively load the target server and will consume a lot of your local resources as you try to handle all the responses.

The simplest scheme for this is to send one request, when you get the response, send the next and so on. This would only ever have one request in flight at the same time.

You can typically improve throughput by managing a small number of requests in flight at the same time (perhaps 3-5).

And, if the target server implements rate limiting, then you may have to slow down the pace of requests you send to it (no more than N per 60 seconds).

There are lots of ways to do this. Here are pointers to some functions that implement various ways to do this.

mapConcurrent() here and pMap() here: These let you iterate an array, sending requests to a host, but manages things so that you only ever have N requests in flight at the same time where you decide what the value of N is.

rateLimitMap() here: Let's you manage how many requests per second are sent.

score 1 · Answer 3 · answered Apr 07 '21 at 19:14

I would personally do something like this:

// currentIndex is the index of the next file to fetch
const currentIndex = 0;
// numWorkers is the maximum number of simultaneous downloads
const numWorkers = 10;
// promises holds each of our workers promises
const promises = [];

// getNextFile will download the next file, and after finishing, will
// then download the next file in the list, until all files have been 
// downloaded
const getNextFile = (resolve) => {
    if (currentIndex >= file_list.length) resolve();
    const currentFile = file_list[currentIndex];
    // increment index so any other worker will not get the same file.
    currentIndex++;
    const file = fs.createWriteStream('files/' + currentFile ); 
    https.get(single_file, response => {
        var stream = response.pipe(single_file);
        stream.on("finish", function() {
            if (currentIndex === file_list.length) {
                resolve();
            } else {
                getNextFile(resolve);
            }
        });
    });
}
for (let i = 0; i < numWorkers; i++) {
    promises.push(new Promise((resolve, reject) => {
        getNextFile(resolve);
    }));         
}

Promise.all(promises).then(() => console.log('All files complete'));

How to download huge list of remote files using Node JS http get method without running into errors

3 Answers3

Linked