9

I have a simple, single-client setup for MongoDB and PyMongo 2.6.3. The goal is to iterate over each document in the collection collection and update (save) each document in the process. The approach I'm using looks roughly like:

cursor = collection.find({})
index = 0
count = cursor.count()
while index != count:
    doc = cursor[index]
    print 'updating doc ' + doc['name']
    # modify doc ..
    collection.save(doc)
    index += 1
cursor.close()

The problem is that save is apparently modifying the order of documents in the cursor. For example, if my collection is made of 3 documents (ids omitted for clarity):

{
    "name": "one"
}
{
    "name": "two"
}
{
    "name": "three"
}

the above program outputs:

> updating doc one
> updating doc two
> updating doc two

If however, the line collection.save(doc) is removed, the output becomes:

> updating doc one
> updating doc two
> updating doc three

Why is this happening? What is the right way to safely iterate and update documents in a collection?

calebds
  • 25,670
  • 9
  • 46
  • 74

3 Answers3

13

Found the answer in MongoDB documentation:

Because the cursor is not isolated during its lifetime, intervening write operations on a document may result in a cursor that returns a document more than once if that document has changed. To handle this situation, see the information on snapshot mode.

Snapshot mode is enabled on the cursor, and makes a nice guarantee:

snapshot() traverses the index on the _id field and guarantees that the query will return each document (with respect to the value of the _id field) no more than once.

To enable snapshot mode with PyMongo:

cursor = collection.find(spec={},snapshot=True)

as per PyMongo find() documentation. Confirmed that this fixed my problem.

calebds
  • 25,670
  • 9
  • 46
  • 74
  • 1
    Looks like `snapshot` was deprecated in MongoDB 3.6 and removed in Mongo DB 4.0. Haven't found a solution without `snapshot` `find()` [documentation](http://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.find). – josh Nov 29 '18 at 21:24
  • Found a working solution here: https://stackoverflow.com/q/12589792/828394. – josh Nov 29 '18 at 21:49
5

Snapshot does the work.

But on pymongo 2.9 and onwards, the syntax is slightly different.

cursor = collection.find(modifiers={"$snapshot": True})

or for any version,

cursor = collection.find({"$snapshot": True})

as per the PyMongo documentations

Dhiresh Jain
  • 464
  • 5
  • 15
2

I couldn't recreate your situation but maybe, off the top of my head, because fetching the results like you're doing it get's them one by one from the db, you're actually creating more as you go (saving and then fetching the next one).

You can try holding the result in a list (that way, your fetching all results at once - might be heavy, depending on your query):

cursor = collection.find({})
# index = 0
results = [res for res in cursor] #count = cursor.count()
cursor.close()
for res in results: # while index != count //This will iterate the list without you needed to keep a counter:
    # doc = cursor[index] // No need for this since 'res' holds the current record in the loop cycle
    print 'updating doc ' + res['name'] # print 'updating doc ' + doc['name']
    # modify doc ..
    collection.save(res)
    # index += 1 // Again, no need for counter

Hope it helps

wilfo
  • 685
  • 1
  • 6
  • 19