2

I have a mongodb database with 10-15 million entries. For each of them, I have to update a field that initially does not exist. Assuming that the application crashed due to an unexpected server shutdown, how is it best to update the remaining entries?

Should I use field: {$exists: false} and update those, or should I walk through the entire collection and check for each document if it has that field or not and if so, perform the update? My take on this is that since you can't associate an index with the existence of a field, $exists does basically the same thing. Which one would be faster and why?

Note that the value this field is gonna have is dependent on the other fields of a document so I can't do a multi: true update.

Solution: As @DhruvPathak and @Sammaye suggest, whilst indexes are associated to data, not the fields themselves (so you can't have an index linked to the existence of a field), $exist can take advantage of indexes on the documents where the fields exist and this greatly increases speed.

Additional: Although it's a bit of a side quest, I now know the reason for why the application crashed. The server timed out the cursor because it was being used for too long (given the size of the collection). This can be avoided by using batch_size as explained here.

Community
  • 1
  • 1
thehousedude
  • 611
  • 6
  • 16
  • [explain()](http://docs.mongodb.org/manual/reference/method/cursor.explain/) might help – soulcheck Aug 28 '14 at 13:26
  • 1
    You could add a "field_exists" field to your documents, default is false. In case you update a document you set the "field_exists" to true. And you can easily build an index over this field. – thomas Aug 28 '14 at 13:30
  • @soulcheck True, explain would help with the $exists query, but in the other case where I update each field manually, explain can't give me data on how efficient my loop is on updating each document. This is a pymongo script. – thehousedude Aug 28 '14 at 13:31
  • I would expect that `$exists` would be faster as the bulk of the work stays within MongoDB, but why not just test out both approaches? – JohnnyHK Aug 28 '14 at 13:36
  • 1
    @tomaszbrue That seems like a great idea, since I can index that so it should be fasterthan exists. I'll try it out. – thehousedude Aug 28 '14 at 13:36
  • @JohnnyHK I could test both approaches of the collection if I knew where I was in the cursor when the server crashed. Since I was updating each document, if I start from 0, the loop will find a lot of documents where the field that exist and just skip them. It wouldn't be a fair fight. Plus this is a pymongo script and timing a loop in python might not be as accurate as mongos explain(). – thehousedude Aug 28 '14 at 13:43

3 Answers3

3

My take on this is that since you can't associate an index with the existence of a field

Acccording to MongoDB 2.6 it now can:

> use f
switched to db f
> db.t.insert({a:1})
WriteResult({ "nInserted" : 1 })
> db.t.ensureIndex({b:1})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
> db.t.find({b:{$exists:false}})
{ "_id" : ObjectId("53e88a7dde0848171584d296"), "a" : 1 }
> db.t.find({b:{$exists:false}}).explain()
{
        "cursor" : "BtreeCursor b_1",
        "isMultiKey" : false,
        "n" : 1,
        "nscannedObjects" : 1,
        "nscanned" : 1,
        "nscannedObjectsAllPlans" : 1,
        "nscannedAllPlans" : 1,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
                "b" : [
                        [
                                null,
                                null
                        ]
                ]
        },
        "server" : "ubuntu:27017",
        "filterSet" : false
}
>

from what I can find it seems to be something that was added as of v2.0: https://stackoverflow.com/a/7503114/383478 unfortunately the doc link in that answer has since died.

So querying is clearly faster with this in mind unless you can have version prior to v2.0.

Community
  • 1
  • 1
Sammaye
  • 43,242
  • 7
  • 104
  • 146
  • Thanks, I didn't know. Although that seems to be an ascending index. From other questions I've found, people wanted to associate an index with the actual existence of the field and not its value. It's good though that it's able to use this one too. – thehousedude Aug 28 '14 at 14:22
  • 1
    @thehousedude this would, of course, be better for key value stores, mongodb has been evaluating http://rocksdb.org/ from what I sw on their JIRA recently. That's one reason they are making a pluggable storage engine so they can give you these things like you can get in other techs like MySQL etc – Sammaye Aug 28 '14 at 14:29
1

An explain() appended to your query will reveal that $exists has a cursor of type Basic : which means that it will scan all the documents in the collection. Hence, you can walk through the entire collection and check for each document if it has that field or not and if so, perform the update. The performance of of $exists is similar to scanning a collection in entirety.

jbihan
  • 3,053
  • 2
  • 24
  • 34
vmr
  • 1,895
  • 13
  • 24
1

Yes you are right, in both cases mongodb would be iterating over the whole set. Another good way to accomplish this would be to create an index on this field, use $exists operator to select the targeted ones, and then drop the index. Make sure the index created is not sparse, as it will not help in $exists case. http://docs.mongodb.org/manual/core/index-sparse/

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175