1

I need help implementing a file downloader in nodejs.

So i need to download over 25'000 files from a server. Im using node-fetch but i don't exactly know how to implement this. I tried using Promise.allSettled() but i also need a way to limit the amount of concurrent requests to the server otherwise i get rate-limited.

This is my code so far:

const fetch = require('node-fetch')

async function main () {
  const urls = [
    'https://www.example.com/foo.png',
    'https://www.example.com/bar.gif',
    'https://www.example.com/baz.jpg',
    ... many more (~25k)
  ]

  // how to save each file on the machine with same file name and extension?
  // how to limit the amount of concurrent requests to the server?
  const files = await Promise.allSettled(
    urls.map((url) => fetch(url))
  )
}

main()

So my questions are:

  • How do i limit the amount of concurrent requests to the server? Can this be solved using a custom https agent with node-fetch and setting the maxSockets to something like 10?
  • How do i check if the file exists on the server and if it does then download it on my machine with the same file name and extension?

It would be very helpful if someone could show a small example code how i would implement such functionality.

Thanks in advance.

magic88
  • 11
  • 2
  • 4

1 Answers1

2

To control how many simultaneous requests are running at once, you can use any of these three options:

mapConcurrent() here and pMap() here: These let you iterate an array, sending requests to a host, but manages things so that you only ever have N requests in flight at the same time where you decide what the value of N is.

rateLimitMap() here: Let's you manage how many requests per second are sent.

Can this be solved using a custom https agent with node-fetch and setting the maxSockets to something like 10?

I'm not aware of any solution using a custom https agent.

How do i check if the file exists on the server and if it does then download it on my machine with the same file name and extension?

You can't directly access a remote http server's file system. So, all you can do is make an http request for a specific resource (a url) and examine the http response to see if it returned data or returned some sort of http error such as a 404.

As for filenames and extensions, that depends entirely upon whether you already know what to request and the server supports that being part of the URL or whether the server returns to you that information in an http header. If you're requesting specific filename and extension, then you can just create a file with that name and extension and save the http response data to that file on your local drive.

As for coding examples, the doc for node-fetch() shows examples of downloading data to a file using streams here: https://www.npmjs.com/package/node-fetch#streams.

import {createWriteStream} from 'fs';
import {pipeline} from 'stream';
import {promisify} from 'util'
import fetch from 'node-fetch';

const streamPipeline = promisify(pipeline);

const url='https://github.githubassets.com/images/modules/logos_page/Octocat.png';
const response = await fetch(url);

if (!response.ok) throw new Error(`unexpected response ${response.statusText}`);

await streamPipeline(response.body, createWriteStream('./octocat.png'));

Personally, I wouldn't use node-fetch as it's design center is to mimic the browser implementation of node which is not as friendly an API design as similar libraries built explicitly for nodejs. I use got(), and there are several other good libraries listed here. You can pick your favorite.

Here's a code example using the got() library:

import {promisify} from 'node:util';
import stream from 'node:stream';
import fs from 'node:fs';
import got from 'got';

const pipeline = promisify(stream.pipeline);

await pipeline(
    got.stream('https://sindresorhus.com'),
    fs.createWriteStream('index.html')
);
jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • Okay i see, but how would you do that for multiple concurrent requests at the same time? You only showed an example for 1 particular url to download but how do i scale that up to ~25k urls? And how can i `console.log` a message when a download failed or succeed? – magic88 Dec 29 '21 at 12:29
  • @magic88 - You can just put this code in a `async` function and call it N times in a loop. If you use an array to collect the promises created in the loop, then you can tell when they are all done with `Promise.all()` or `Promise.allSettled()`. – jfriend00 Dec 29 '21 at 17:40
  • I do recommend node-fetch or upgrading node to a version that supports fetch built-in. Deno compatibility for future upgrades is the main reason. (receiving a 500 instead of throwing an error with 500 is moot - it successfully hit the server, 500 is not an error...you should expect this happening when the server dl from has an issue) – TamusJRoyce Mar 16 '23 at 19:16