3

I've got a mongo db instance with a collection in it which has around 17 million records.

I wish to alter the document structure (to add a new attribute in the document) of all 17 million documents, so that I dont have to problematically deal with different structures as well as make queries easier to write.

I've been told though that if I run an update script to do that, it will lock the whole database, potentially taking down our website.

What is the easiest way to alter the document without this happening? (I don't mind if the update happens slowly, as long as it eventually happens)

The query I'm attempting to do is:

db.history.update(
  { type : { $exists: false }},
  {
    $set: { type: 'PROGRAM' }
  },
  { multi: true } 
)
Bruce Lowe
  • 6,063
  • 1
  • 36
  • 47

2 Answers2

4

You can update the collection in batches(say half a million per batch), this will distribute the load.

I created a collection with 20000000 records and ran your query on it. It took ~3 minutes to update on a virtual machine and i could still read from the db in a separate console.

> for(var i=0;i<20000000;i++){db.testcoll.insert({"somefield":i});}

The locking in mongo is quite lightweight, and it is not going to be held for the whole duration of the update. Think of it like 20000000 separate updates. You can read more here:

http://docs.mongodb.org/manual/faq/concurrency/

Sergey Eremin
  • 10,994
  • 2
  • 38
  • 44
  • is there an easy way to batch that query? Or would be be a case of a find() query, use a cursor, and loop over the results doing updates 1 document at a time. – Bruce Lowe Dec 23 '13 at 08:40
1

You do actually care if your update query is slow, because of the write lock problem on the database you are aware of, both are tightly linked. It's not a simple read query here, you really want this write query to be as fast as possible.

Updating the "find" part is part of the key here. First, since your collection has millions of documents, it's a good idea to keep the field name size as small as possible (ideally one single character : type => t). This helps because of the schemaless nature of mongodb collections.

Second, and more importantly, you need to make your query use a proper index. For that you need to workaround the $exists operator which is not optimized (several ways to do it there actually).

Third, you can work on the field values themselves. Use http://bsonspec.org/#/specification to estimate the size of the value you want to store, and eventually pick a better choice (in your case, you could replace the 'PROGRAM' string by a numeric constant for example and gain a few bytes in the process, multiplied by the number of documents to update for each update multiple query). The smaller the data you want to write, the faster the operation will be.

A few links to other questions which can inspire you :

Can MongoDB use an index when checking for existence of a field with $exists operator?

Improve querying fields exist in MongoDB

Community
  • 1
  • 1
Calimero
  • 4,238
  • 1
  • 23
  • 34