2

I must do some data processing for one of my company's clients. They have a database of about 4.7GB of data. I need to add a field to each of these documents calculated using two properties of the mongo documents and an external reference.

My problem is, I can not do collection.find() because Node.js runs out of memory. What is the best way to iterate through an entire collection that is too large to load with a single call to find?

Community
  • 1
  • 1
awimley
  • 692
  • 1
  • 9
  • 29

1 Answers1

5

yes, there is a way. Mongo is designed to handle large datasets.

You are probably running out of memory, not because of db.collection.find(), but because you are trying to dump it all at once with something like db.collection.find().toArray().

The correct way to operate over resultsets that are bigger than memory is to use cursors. Here's how you'd do it in mongo console:

var outsidevars = {
   "z": 5
};

var manipulator = function(document,outsidevars) {
    var newfield = document.x + document.y + outsidevars.z;
    document.newField = newfield;
    return document;
};

var cursor = db.collection.find();

while (cursor.hasNext()) {
    // load only one document from the resultset into memory
    var thisdoc = cursor.next();
    var newnoc = manipulator(thisdoc,outsidevars);
    d.collection.update({"_id": thisdoc['_id']},newdoc);
};
code_monk
  • 9,451
  • 2
  • 42
  • 41
  • 1
    Good answer, however I asked how to do it in node.js, not in the mongo console. The linked question in the comments has a better answer, so I'm flagging this as a duplicate. – awimley Oct 09 '15 at 19:52
  • How does it compare with cursor's forEach method? – WoLfPwNeR Apr 02 '19 at 00:35
  • 1
    Currently, the method to get the document is `next()`, not `getNext()`. – Steven Spungin Jan 06 '20 at 13:47
  • Thanks @StevenSpungin. I changed it – code_monk Jan 08 '20 at 18:44
  • How does this work on a live database that has writes coming in at the same time as running this script. Will the new documents be included in the iteration? Also am I right that this will lock up the database and I should add a timeout between every update? – Stefan Aug 23 '21 at 14:20