0

I have several large collections in a mongo db. some with sizes ~500 MB.

(Collections get bigger over time reaching the 500MB in size and bigger)
    Collection1:
           109091 documents
           total size: 154.3MB 
           avg size of document: 1.4KB
        Collection2:
           102197 documents
           TOTAL SIZE
           15.1MB
           AVG. SIZE
           155B
        Collection3:
           319 documents
           TOTAL SIZE
           115.8KB
           AVG. SIZE
           372B

Collection1 relates to collection2 on the acc field Collection1 relates to Collection3 on the tc field

Only index in each of the collection its the record _id which I don't really use to query the collections.

I am running node js on my local machine: windows 10 RAM: 16 GB

The mongoDB database is in a linux machine which I dont have access to.

I am trying to iterate through one of the collections and process each record while fetching the corresponding record on another collection.

processStream : function(){
    var stream = Collection.find().cursor();
    stream.on('data', function (doc) {
        if (doc.tn !== null && (doc.tc !== null || doc.cTc !== null) && doc.acc !== null){
            //function that findOne per the account field
            module.exports.getAcc(doc.acc, function(res){
                if ( res!== null ){
                    //function that findOne per the tc field
                    module.exports.getProd(doc.tc, function(res2){
                        if (res2 !== null){
                            //write to another mongodb collection
                        }
                    });
                }
            });
        }
    }).on('error', function (err) {
        console.log(err);
    }).on('close', function () {
        console.log('processing finished');
        callback();
    });
}

Then in the routes file I have the following call

app.get('/api/process/', function(req, res){

    console.time('proc');
    functions.processStream(function(){
        console.timeEnd('proc');
        console.log('...operations ended');
    });
    res.json('...processind started');
});

This takes for a long time. The bigger the collection the longer it takes. Is there any other way to loop over a collection in mongoDB and process each of its records any faster? I am sure node/mongodb/mongoose can be used for a lot larger collections (GB?? maybe)..

Taking into consideration the size of my collection just looping through the records in the collections it took 19 minutes.

  • Not really seeing how this deviates from your previous question ( [NodeJS process each record of a very large mongodb collection blocks event loop](https://stackoverflow.com/q/47103450/2313887) ), other than writing some different code to what you did before, and still not what you were advised to do. *"It takes a long time"* does not really tell us much. Why are you doing this? What is the intended purpose? All you are doing is pulling all the data ( and related data ) without filter, so you should really explain your intent. There is also notably no code here showing any output. – Neil Lunn Nov 04 '17 at 22:48
  • @NeilLunn there is no output. I just want to make sure that whatever structure I use going forward will trully scale and that I am not breaking any nodeJs principles. The intended purpose is to do reconciliation between several systems. If A->B->C are different systems and produce different collection of records I want to process them all in order to understand which are in sync and which are not. I would also like to do historical statistics as the data from the different systems change through time. – JoaoFilipeClementeMartins Nov 04 '17 at 22:51
  • The reason I asked you to clarify both here an on the previous question is "without context" all you are really asking here is "how do I join?", which is essentially already answered and posting again is really just inviting copies of existing answers. Wrapping that in *"It takes too long..."* does not tell us anything. How many documents? What kind of hardware? What indexes are in place if any? Just to name a few. – Neil Lunn Nov 04 '17 at 22:57
  • @NeilLunn I totally understand where you are coming from. Edited the question to include the relevant information. – JoaoFilipeClementeMartins Nov 04 '17 at 23:06
  • *"which I don't really use to query the collections."* And you are asking why this is slow? So how are the collections actually "related"? That's really a crucial question. Nothing here shows how the other data is being "looked up", though there is some indication on your previous question. Add indexes for the fields you are querying on. And simply use [`$lookup`](https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/) like you were advised before. Holding the question again until you can actually show some reason why this is different. I don't yet see how it is or can be. – Neil Lunn Nov 04 '17 at 23:16
  • @NeilLunn adding the indexes now. Also, what do you mean by relate? like in a sql db where you would have foreign keys? – JoaoFilipeClementeMartins Nov 04 '17 at 23:24
  • Do the queries need to be changed to accomodate the indexed fields? – JoaoFilipeClementeMartins Nov 04 '17 at 23:32
  • What I think you should do here is take some time to pour over the answers on the linked question, and probably just "google" for "MongoDB Join" in general. Then of course is the documentation of `$lookup` ( already linked ) and again "google" for material that may show other peoples usage. Then sit down for a day and read all of [Indexes](https://docs.mongodb.com/manual/indexes/) in the documentation. Bottom line is that your approach here does not show a lot of research. This is not a relational database, and you need to understand what you are doing and most importantly "why". – Neil Lunn Nov 04 '17 at 23:38
  • Please don't think I haven't done my research. I just never used node or mongodb before. Trying to figure my way through. I appreciate your patience and guidance. – JoaoFilipeClementeMartins Nov 04 '17 at 23:41

0 Answers0