How to ensure that cursor does not return duplicate document in MongoDB?

Question

In MongoDB, read operation on the collection returns cursor.

If the read operation is accessing most of the documents in the collection and it may be possible that it may interleave with other update operation.

In that case, may it be possible that cursor will have duplicate documents ?

How to make sure that cursor will avoid duplicates ?

you could use 'distinct' method , check this out http://stackoverflow.com/questions/5089162/mongodb-get-distinct-records — ogres, Mar 06 '13 at 07:33

score 4 · Accepted Answer · answered Mar 06 '13 at 09:28

4

The distinct method will not be of much help here. This is not a problem that the function can solve, not only that but it is a fraction of the speed of a normal cursor.

If the read operation is accessing most of the documents in the collection and it may be possible that it may interleave with other update operation.

It is possible if the documents move in such a manner that with the sort of the cursor they get read again.

Whether this is a problem or not depends, if you are sorting by something that won't be updated, for example _id, then you don't really need to worry, however, if you are sorting by something that will be updated and could shift then yes; you will have a problem.

One method of solving this is to look at the last _id in that iteration of the cursor, filling the cursor into batchs of 1000 in an array or something. After you have the last _id in that batch you range, taking everything greater than that _id.

Another method could be to do snapshot queries: http://docs.mongodb.org/manual/reference/operator/snapshot/ however this function has quite a few limitations, for example it cannot be used with sharded collections.

answered Mar 06 '13 at 09:28

Sammaye

43,242
7
104
146

Additonal: I have found that "taking everything greater than Previous _id" is (very) fast than using other criteria's. This must be because mongo is able to use btree cursor while searching for the records. – Bhagwan Parge Sep 27 '18 at 14:15
@bhagwanparge No, it is because of the formulation of the _id index, while it is true that the btree $min and $Max can be used the _id is also incrementing by time, this makes it very easy to sort the documents by greater/less than – Sammaye Sep 27 '18 at 15:12
In mongo all _id fields are by default indexed, so they must be using the same btree cursor thing. – Bhagwan Parge Sep 28 '18 at 18:06
@bhagwanparge No, only the first _id field is indexed, the one MongoDB actually creates, if you were to make a subdocument with an _id field then that would not be indexed by default. However, regardless, the index is not actually what makes this so fast, it is the nature of the value of the index, if you used a rand() value for _id it would not have the same performance as the one which is formulated by MongoDB by default – Sammaye Sep 28 '18 at 18:18
@bhagwanparge it is in fact the nature of the value that means the MongoDB binary only needs to load a small fraction of the _id index to understand if what you insert is unique, rather than having to load the entire index – Sammaye Sep 28 '18 at 18:19

How to ensure that cursor does not return duplicate document in MongoDB?

1 Answers1