Fetching Large File For Processing Using Node.js

Question

I have a Node.js application that needs to fetch this 6GB zip file from Census.gov and then process its content. However when fetching the file using Node.js https API, the downloading stops at different file size. Sometime it fails at 2GB or 1.8GB and so on. I am never able to fully download the file using the application but its fully downloaded when using the browser. Is there any way to download the full file? I cannot start processing the zip until its fully download, so my processing code waits for the download to complete before executing.

const file = fs.createWriteStream(fileName);
http.get(url).on("response", function (res) {
      let downloaded = 0;
      res
        .on("data", function (chunk) {
          file.write(chunk);
          downloaded += chunk.length;
          process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
        })
        .on("end", async function () {
          file.end();
          console.log(`${fileName} downloaded successfully.`);
        });
    });

Is there any error printed to console? What kind of processing you want to do on the file? — niceman, Aug 13 '22 at 16:18
Does this answer your question? [What is the best way to download a big file in NodeJS?](https://stackoverflow.com/questions/44896984/what-is-the-best-way-to-download-a-big-file-in-nodejs) — Christopher, Aug 13 '22 at 16:21

jfriend00 · Accepted Answer · 2022-08-14T20:21:00.477

You have no flow control on the file.write(chunk). You need to pay attention to the return value from file.write(chunk) and when it returns false, you have to wait for the drain event before writing more. Otherwise, you can overflow the buffer on the writestream, particularly when writing large things to a slow medium like disk.

When you lack flow control when attempting to write large things faster than the disk can keep up, you will probably blow up your memory usage because the stream has to accumulate more data in its buffer than is desirable.

Since your data is coming from a readable, when you get false back from the file.write(chunk), you will also have to pause the incoming read stream so it doesn't keep spewing data events at you while you're waiting for the drain event on the writestream. When you get the drain event, you can then resume the readstream.

FYI, if you don't need the progress info, you can let pipeline() do all the work (including the flow control) for you. You don't have to write that code yourself. You may even be able to still gather the progress info, by just watching the writestream activity when using pipeline().

Here's one way to implement the flow control yourself, though I'd recommend you use the pipeline() function in the stream module and let it do all this for you if you can:

const file = fs.createWriteStream(fileName);
file.on("error", err => console.log(err));
http.get(url).on("response", function(res) {
    let downloaded = 0;
    res.on("data", function(chunk) {
        let readyForMore = file.write(chunk);
        if (!readyForMore) {
            // pause readstream until drain event comes
            res.pause();
            file.once('drain', () => {
                res.resume();
            });
        }
        downloaded += chunk.length;
        process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
    }).on("end", function() {
        file.end(); console.log(`${fileName} downloaded successfully.`);
    }).on("error", err => console.log(err));
});

There also appeared to be a timeout issue in the http request. When I added this:

// set client timeout to 24 hours
res.setTimeout(24 * 60 * 60 * 1000);

I was then able to download the whole 7GB ZIP file.

Here's turnkey code that worked for me:

const fs = require('fs');
const https = require('https');
const url =
    "https://www2.census.gov/programs-surveys/acs/summary_file/2020/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.zip";
const fileName = "census-data2.zip";

const file = fs.createWriteStream(fileName);
file.on("error", err => {
    console.log(err);
});
const options = {
    headers: {
        "accept-encoding": "gzip, deflate, br",
    }
};
https.get(url, options).on("response", function(res) {
    const startTime = Date.now();

    function elapsed() {
        const delta = Date.now() - startTime;
        // convert to minutes
        const mins = (delta / (1000 * 60));
        return mins;
    }

    let downloaded = 0;
    console.log(res.headers);
    const contentLength = +res.headers["content-length"];
    console.log(`Expecting download length of ${(contentLength / (1024 * 1024)).toFixed(2)} MB`);
    // set timeout to 24 hours
    res.setTimeout(24 * 60 * 60 * 1000);
    res.on("data", function(chunk) {
        let readyForMore = file.write(chunk);
        if (!readyForMore) {
            // pause readstream until drain event comes
            res.pause();
            file.once('drain', () => {
                res.resume();
            });
        }
        downloaded += chunk.length;
        const downloadPortion = downloaded / contentLength;
        const percent = downloadPortion * 100;
        const elapsedMins = elapsed();
        const totalEstimateMins = (1 / downloadPortion) * elapsedMins;
        const remainingMins = totalEstimateMins - elapsedMins;

        process.stdout.write(
            `  ${elapsedMins.toFixed(2)} mins, ${percent.toFixed(1)}% complete, ${Math.ceil(remainingMins)} mins remaining, downloaded ${(downloaded / (1024 * 1024)).toFixed(2)} MB of ${fileName}                                 \r`
        );
    }).on("end", function() {
        file.end();
        console.log(`${fileName} downloaded successfully.`);
    }).on("error", err => {
        console.log(err);
    }).on("timeout", () => {
        console.log("got timeout event");
    });
});

Hello jfriend00. Thank you for your answer. I tried the pipeline() as well as your code above and the download still fails. I tried this codebase in Python and same thing happened. Could it be possible that the census.gov website is not allowing to complete this download. This is the url that I am using to download the zip file, https://www2.census.gov/programs-surveys/acs/summary_file/2020/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.zip Also I am using: `process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0"` And I am skipping certs rejection. — Sam, Aug 14 '22 at 13:13
@SurajShrestha - For starters, you have to use `https.get()` instead of `http.get()` when you have an https URL. — jfriend00, Aug 14 '22 at 16:17
@SurajShrestha - I ran the code on my system and got a file 1,932,878,060 bytes long. It has the appropriate beginning of the file to be a zip file, but Windows thinks it's invalid. I was surprised how slow it downloaded. Since I have a fast network, it must be slow on the source. It's large, but still went pretty slowly. I see from the browser that the file is supposed to be 6.6GB in size and the browser thinks the download will take 2 hours over my WiFi. — jfriend00, Aug 14 '22 at 17:43
@SurajShrestha - The only guess I have is that something somewhere is timing out after awhile and not allowing a download to go that long. I'm running some additional experiments, but those experiments take awhile. Will let you know if I discover anything else. — jfriend00, Aug 14 '22 at 17:44
@SurajShrestha - I made two meaningful changes and my download is now still going much longer than before (it still has another 30 minutes to complete). First, I added `res.setTimeout(24 * 60 * 60 * 1000);` to set the client-side timeout to 24 hours so the http client wouldn't timeout. I think that was probably the main culprit. Second, I added a header on the request `"accept-encoding": "gzip, deflate, br"`. I discovered that the browser was sending this header and when sending it, it caused the server to send back the `content-length` which I found useful in my testing. — jfriend00, Aug 14 '22 at 19:44
@SurajShrestha - I don't know if the second change to add the header is necessary or not, but it takes several hours to run a test without it to verify whether it's needed or not so I thought I'd share my info without waiting that long. And, having the content-length, allows my progress update to include the %complete and to provide an estimate of the minutes remaining, both of which are useful in monitoring something that takes several hours. — jfriend00, Aug 14 '22 at 19:45
@SurajShrestha - I got it to work and downloaded the whole 7GB zip file. I've updated my answer to include the code I ran to get it work. I think the key was `res.setTimeout(24 * 60 * 60 * 1000)`. — jfriend00, Aug 14 '22 at 20:21
jfriend00, Thank you so much for your help. Regarding the use of http instead of https. I misspelled it cause I was doing this `import http from "https"`. Your solution worked for me and I am very grateful. Thank you for all your help. — Sam, Aug 15 '22 at 02:45
Any suggestion for this question https://stackoverflow.com/questions/73436311/running-a-node-js-application-once-every-year — Sam, Aug 21 '22 at 16:35

Fetching Large File For Processing Using Node.js

1 Answers1