Requesting many connections to another site with Cross Origin Resource Sharing activated close in time makes the data corrupted

Question

I am making a node.js application and part of my code requests for data from 193 different urls to download the json data from each url. Here is one of those urls: https://www.gemeentegeschiedenis.nl/gemeentenaam/json/Apeldoorn For the some the downloaded json data is fine and is complete. However towards the end, corruptions happen for some of the files. Part of the data becomes nullified and then there are some that have database errors. I think it has to do with requesting data from so many urls in a short amount of time (which is why I tried the "setTimeout" function (but that doesn't really work)).

function writeToFile(url) {
    // get name to make each new file unique
    var name = url.split("json/")[1];
    var fileStream = fs.createWriteStream(`jsonFiles/${name}.json`);
    var options = {
        url: `${url}`,
        method: 'GET',
        headers: {
            'Accept': 'application/json',
            'Accept-Charset': 'utf-8',
            json: true
        }
    }
    //request the data from the site and download to the file.
    request.get(options).pipe(fileStream);
}

function getMunicipalityGeoJsonData(req, res) {
    //Get all the urls pointing to the JSON data for the province, Gelderland
    getGelderlandJsonUrls((err, jsonUrls) => {
        //for all those urls, write the data to files.
        for (url of jsonUrls) {
            console.log(url);
            writeToFile(url);
        }
    })
}

function getGelderlandJsonUrls(callback) {
    getMunicipalityJsonUrls("Gelderland", (err, data) => {
        jsonUrls = data;
        callback(null, jsonUrls);
    });
}

function getMunicipalityJsonUrls(provinceName, callback) {
    request({ uri: `https://www.gemeentegeschiedenis.nl/provincie/json/${provinceName}` }, (error, response, body) => {
        body = JSON.parse(body);
        // extracting each json URL from all the municipalities in Gelderland
        var jsonUrls = [];
        var numberMun = body.length;
        for (var i = 0; i < numberMun; i++) {
            var url = body[i].uri.naam;
            var urlSplit = url.split("gemeentenaam");
            var jsonUrl = urlSplit[0] + "gemeentenaam/json" + urlSplit[1];
            jsonUrl = jsonUrl.replace("http://", "https://");
            jsonUrls.push(jsonUrl);
        }
        callback(null, jsonUrls);
    });
}

The last json data downloaded into the file as an html page with a database error from the url: https://www.gemeentegeschiedenis.nl/gemeentenaam/json/Zutphen which actually just took just under 6 seconds to load up looking at the network tab on Chrome the 1812 has null for its properties when it should have a bunch of coordinates https://www.gemeentegeschiedenis.nl/gemeentenaam/json/Winssen (took just over a second to load on chrome

I am a noob at node, but please help me fix this issue maybe with some sort of checking if the data is corrupted or something. Thanks for the help in advanced:)

EDIT: I am trying to do up to 200 urls at a time in the for loop.

Welcome to stackoverflow. The rules here require that you post the code for a minimal, reproducible example of your problem in your question. You must include the code itself in the question and format it appropriately. Code should not only be in an external link because external links have a habit of going dead or the content changing over time which renders the question useless as a long term reference for others. Please put the relevant portion of your code in your actual question. You can use the "edit" link below your question to modify it. — jfriend00, Mar 16 '20 at 05:32
And, when you do post code, it should always be text, not images. When it's images, people can't copy paste it into answers or into their own test programs. — jfriend00, Mar 16 '20 at 05:32
Are you trying to write each response to the same file? Or to separate files? If it's the same file, then you can't run that in parallel, you will have to sequence each of your operations one after the other. — jfriend00, Mar 16 '20 at 05:34
You also have NO error handling anywhere in your code. There are probably errors that you are ignoring and/or not seeing. Put appropriate error handling on both the write stream and the request stream. You need to not be blind here - you need to see where errors are occurring. FYI, this should ALWAYS be one of the first things you do when code isn't working like this. Look for places you aren't logging an error and go overboard in logging every possible error. That's how you see what's going on rather than just have your head in the sand and wonder why it isn't working. — jfriend00, Mar 16 '20 at 05:36
And, your `setTimeout()` doesn't help anything. All it does is wait 5 seconds and then starts them all at pmce. It doesn't space them out 5 seconds apart. If you edit your question to add a text version of the code, I will take a stab at rewriting it, but for me to help, you will have to respond before I go to sleep tonight. — jfriend00, Mar 16 '20 at 05:38
You may benefit from one of the functions such as `mapConcurrent()`, `runN()` or `pMap()` mentioned in this answer: [Batching lots of asynchronous operations](https://stackoverflow.com/questions/59976352/properly-batch-nested-promises-in-node/59976509#59976509). — jfriend00, Mar 16 '20 at 05:42
Sorry I haven't looked at this until now. I am about to go to sleep too soon. I will have to edit the question more later on. But I will look at that link. Thanks! — Jordi Kloosterboer, Mar 16 '20 at 06:22

jfriend00 · Accepted Answer · 2020-03-16T15:14:21.963

First off, add proper error handling to getMunicipalityJsonUrls() and to getGelderlandJsonUrls(). This means:

Check err parameter everywhere it's present and propagate the error back to the caller.
Capture possible errors from JSON.parse()
Check http statusCode.

Here's that fixed up code:

function getMunicipalityJsonUrls(provinceName, callback) {
    request({ uri: `https://www.gemeentegeschiedenis.nl/provincie/json/${provinceName}` }, (error, response, body) => {
        if (err) {
            callback(err);
            return;
        }
        if (response.statusCode !== 200) {
            callback(new Error(`http status code ${response.statusCode}`));
            return;
        }
        try {
            const jsonUrls = JSON.parse(body).map(url => {
                let urlSplit = url.split("gemeentenaam");
                let jsonUrl = urlSplit[0] + "gemeentenaam/json" + urlSplit[1];
                return jsonUrl.replace("http://", "https://");
            });
            callback(null, jsonUrls);
        } catch(e) {
            callback(e);
        }
    });
}

function getGelderlandJsonUrls(callback) {
    getMunicipalityJsonUrls("Gelderland", (err, data) => {
        if (err) {
            callback(err);
        } else {
            callback(null, data);
        }
    });
}

Then, in writeToFile(), add error handling and completion monitoring and I chose to wrap it in a promise rather than a plain callback because I want to use it with some utilities that work with promises.

function writeToFile(url) {
    return new Promise((resolve, reject) => {
        // get name to make each new file unique
        var name = url.split("json/")[1];
        var fileStream = fs.createWriteStream(`jsonFiles/${name}.json`);
        fileStream.on('error', (e) => {
            reject(e);
        });
        var options = {
            url: `${url}`,
            method: 'GET',
            headers: {
                'Accept': 'application/json',
                'Accept-Charset': 'utf-8',
                json: true
            }
        }
        //request the data from the site and download to the file.
        request.get(options).pipe(fileStream).on('error', (e) => {
            reject(e);
        }).on('finish', () => {
            resolve(url);
        });
    });
}

Now, we need to decide how to loop through all the URLs. If any of the urls could ever be attempting to write to the same file (if that's even a remote possibility), then you have to serialize the URLs to prevent them from ever having more than one asynchronous operation trying to write to the same file at the same time because that will just mess up that file. So, if that was the case, you could serialize the writing to the file like this:

// option 1 - serialize writing to files
async function getMunicipalityGeoJsonData(req, res) {
    //Get all the urls pointing to the JSON data for the province, Gelderland
    getGelderlandJsonUrls((err, jsonUrls) => {
        if (err) {
            console.log(err);
            res.sendStatus(500);
        } else {
            try {
                //for all those urls, write the data to files.
                for (url of jsonUrls) {
                    console.log(url);
                    await writeToFile(url);
                }
                res.send("All done");
            } catch(e) {
                console.log(e);
                res.sendStatus(500);
            }
        }
    });
}

If you are absolutely sure that none of these URLs will ever cause writing to the same file, then you can run N of them at a time where you determine what the lowest value of N is that gets you decent performance. Higher values of N consume more peak resources (memory and file handles). Lower values of N run less things in parallel. If the target hostnames are all the same server, then usually you don't want N to be more than about 5. If the target hosts you are retrieving data from are all different, you can experiment with values of N up to maybe 20.

// option 2 - run N at a time in parallel
function getMunicipalityGeoJsonData(req, res) {
    //Get all the urls pointing to the JSON data for the province, Gelderland
    getGelderlandJsonUrls((err, jsonUrls) => {
        if (err) {
            console.log(err);
            res.sendStatus(500);
        } else {
            //for all those urls, write the data to files.
            const numConcurrent = 5;
            mapConcurrent(jsonUrls, numConcurrent, writeToFile).then(() => {
                res.send("All done");
            }).catch(err => {
                console.log(err);
                res.sendStatus(500);
            });
        }
    })
}

The mapConcurrent() function comes from this answer Promise.all consumes all my RAM and is as follows. It expects you to pass it an array of items to be iterated over, the max you want in flight at the same time and a function that will be passed an array item and will return a promise connected to when it's done or has an error:

function mapConcurrent(items, maxConcurrent, fn) {
    let index = 0;
    let inFlightCntr = 0;
    let doneCntr = 0;
    let results = new Array(items.length);
    let stop = false;

    return new Promise(function(resolve, reject) {

        function runNext() {
            let i = index;
            ++inFlightCntr;
            fn(items[index], index++).then(function(val) {
                ++doneCntr;
                --inFlightCntr;
                results[i] = val;
                run();
            }, function(err) {
                // set flag so we don't launch any more requests
                stop = true;
                reject(err);
            });
        }

        function run() {
            // launch as many as we're allowed to
            while (!stop && inflightCntr < maxConcurrent && index < items.length) {
                runNext();
            }
            // if all are done, then resolve parent promise with results
            if (doneCntr === items.length) {
                resolve(results);
            }
        }

        run();
    });
}

There are comparable functions in Bluebird's Promise.map() and in the Async library.

So, using this code you now have the ability to control how many of your requests/writeToFile() operations are in-process at the same time and you are capturing and logging all possible errors. Do, you can tune how many can be in flight at the same time for best performance and lowest resource use and, if there are any errors, you should be logging those errors so you can debug.

This code is currently set to stop processing any further URLs if it gets an error. You can change that if you want to continue on to the other URLs if you get an error by tweaking mapConcurrent(). But, I would still make sure you log any errors so you know when there are errors and can investigate why you are seeing errors.

One other note. If this was my code, I would convert everything to promises (no plain callbacks) and I'd use the got() library instead of the now deprecated request() library. I don't write any new code using the request() library.

"Thanks for the feedback! Votes cast by those with less than 15 reputation are recorded, but do not change the publicly displayed post score." I upvoted but stakoverflow doesn't care :P Anyway. I just want to let you know this worked ( I just changed some minor things to get it to completely work.) I'll also take a look into the got() library. I didn't know request() was depreciated. Thank you so much for taking this time to educate me! I used the option 2 for getMunicipalityGeoJsonData since the file names a derivative from the url names. :D — Jordi Kloosterboer, Mar 16 '20 at 17:04
@JordiKloosterboer - If this answered your question, even with less than 15 reputation, you can still "accept" the answer. You do this by clicking the checkmark to the left of the answer and this indicates to the community that your question has now been answered. It also earns you some reputation points for following the proper procedure here. You can only accept one answer per question (you would typically pick the "best answer" if there is more than one good answer). — jfriend00, Mar 16 '20 at 17:10

Requesting many connections to another site with Cross Origin Resource Sharing activated close in time makes the data corrupted

1 Answers1