1

My goal is to insert VERY large csv's, so right not I use the csv streaming like so:

            var myCollection = db.collection(myCollectionId);

            var q = async.queue(Collection.insert.bind(myCollection), 10);

            csv()
            .from.path(myFilePath, {columns: true}) 
            .transform(function(data, index, cb){

                    q.push(data, function (err, res) {
                        if (err) return cb(err);
                        cb(null, res[0]);
                    });

            })
            .on('end', function () {

                q.drain = function() { 

                       //do some stufff
                };

            })
            .on('error', function (err) {
                res.end(500, err.message);
                console.log('on.error() executed');
            });

        });

But when files get REALLY large, like 70M+ and it's streaming them, my server is very slow and it takes forever, and when i try to load pages on the website its lethargic during this process.

Why is it not possible to execute a mongo insert using cron-job like this. I ask because the same insert takes maybe 30 seconds from the mongo command line.

P.S. Don't mind the readFile and lines part, I am doing that because I want to test for when all the lines have been inserted into the collection after the process is started (haven't implemented this yet).

var cronJob = require('cron').CronJob;
var spawn = require('child_process').spawn; 
var fs = require('fs');
function MongoImportEdgeFile(dataID, filePath){

var scriptPath = "/local/main/db/mongodb-linux-x86_64-2.4.5/bin/mongoimport";
console.log("script path = "+scriptPath)
var output = "";

 fs.readFile(filePath, 'utf-8',function(err, data) {

        if (err){
            console.log(err)
            throw err;
        }

        //console.log('data = '+data);
        var lines = data.split('\n');
        console.log("total lines in file = " + lines);

        var job = new cronJob(new Date(), function() {
            // store reference to 'this', which is cronJob object.  needed to stop job after script is done executing.
            var context = this;

            // execute R script asynchronously
            var script = spawn(scriptPath, [" -d mydb -c Data_ForID_" + dataID + " --file " + filePath + " --type csv" ]);
            console.log("Executing R script via node-cron: " + scriptPath);

            // script has finished executing, so complete cron job and fire completion callback
            script.on('close', function() {
                console.log('inside script.on(close, function() for import');
                context.stop();
            });
        }, function() {
            // callback function that executes upon completion
            console.log("Finished executing import");

        }, true);

   });

}

SOUser
  • 610
  • 1
  • 11
  • 25

2 Answers2

1

You shouldn't use individual insert calls. You're forcing mongo to perform internal sync with each call -- I think it's even worse given your parallel approach.

Use bulk insertion: it's as easy as calling insert() with an array.

salezica
  • 74,081
  • 25
  • 105
  • 166
  • when i try to insert as an array of documents I get: – SOUser Mar 06 '14 at 16:08
  • when i try to insert as an array of documents I get Error: Document exceeds maximum allowed bson size of 16777216 bytes – SOUser Mar 06 '14 at 16:16
  • And I have already inspected the array, all documents have the form { a: 'hi', b: 'there', c: 'bye' }. When I go from size 100,000 to 1,000,000 I get the error. – SOUser Mar 06 '14 at 16:19
  • Don't insert each document separately, but don't insert **all** of them at once either! Mongo is obviously not design for a million inserts per second, but also not for 70mb in a single insert. Write code to perform a series of bulk inserts with a configurable maximum, and toy with the number till you find an adequate value – salezica Mar 06 '14 at 16:22
0

You could execute mongoimport directly from node by creating a child process. Here's an article on using mongoimport to import a csv. You can also do json.

Somehow I missed the part about using mongoimport inside cron. If I understand correctly it looks like you somehow know the csv's you would like to import, and you are using cron to check for them.

Have you considered a message queue? This will allow your processor to receive the import job instantaneously instead of on an interval. This will also throttle your processing.

If you need more throughput, you could create additional listener processes that are attached to the same queue. They will compete for the next job. This will allow your solution to scale.

Community
  • 1
  • 1
Josh C.
  • 4,303
  • 5
  • 30
  • 51
  • @SOUser I am suggesting you do this in Node.js, not the command line. However, I made an adjustment to my answer. – Josh C. Mar 06 '14 at 15:26