2

We version most of our collections in Mongodb. The selected versioning mechanism is as follows:

{  "docId" : 174, "v" : 1,  "attr1": 165 }   /*version 1 */
{  "docId" : 174, "v" : 2,  "attr1": 165, "attr2": "A-1" } 
{  "docId" : 174, "v" : 3,  "attr1": 184, "attr2" : "A-1" }

So, when we perform our queries we always need to use the aggregation framework in this way to ensure get latest versions of our objects:

db.docs.aggregate( [  
    {"$sort":{"docId":-1,"v":-1}},
    {"$group":{"_id":"$docId","doc":{"$first":"$$ROOT"}}}
    {"$match":{<query>}}
] );

The problem with this approach is once you have done your grouping, you have a set of data in memory which has nothing to do with your collection and thus, your indexes cannot be used.

As a result, the more documents your collection has, the slower the query gets.

Is there any way to speed this up?

If not, I will consider to move to one of the approaches defined in this good post: http://www.askasya.com/post/trackversions/

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
jbernal
  • 785
  • 1
  • 14
  • 29
  • why you haven't $match at first stage? – Daniele Tassone Nov 03 '16 at 18:23
  • Add a index to docId field of your document. – Parshuram Kalvikatte Nov 03 '16 at 18:38
  • @DanieleTassone I am afraid that is not an option. Explanation is in link I provided. Basically, if you filter at the beginning, you will end up with versions which are not the latest but the sort-group phase will consider them as such. It is a common error when performing versioning like this. – jbernal Nov 04 '16 at 09:02
  • @Parshuram Adding an index to docId would speed up the group operation but not the following $match, wouldn't it? – jbernal Nov 04 '16 at 09:03
  • @jbernal i saw the Link with the details. The most efficient way is explained in the link (db.docs.find({"docId":174}).sort({"v":-1}).limit(-1);) where you can exactly what you want. This work fine if you need 1 document. If you need more documents at same time is a different story: this is something i have not understand, can you explain it better? There are different solutions, but i should understand better. Also - can we consider MongoDB 3.4? – Daniele Tassone Nov 04 '16 at 10:23
  • @DanieleTassone I was meaning the case for retrieving several documents. Imagine you want to perform a normal query to find documents which match a particular field. Then you need to perform the aforementioned aggregate operation which gets really slow as the number of documents increases. I am considering to move to choice 3 explained in the link which is also revisited in follow-up post. MongoDB 3.4 is not an option I am afraid. – jbernal Nov 04 '16 at 11:20
  • @jbernal ok then can you tell me the MongoDB version you can use? – Daniele Tassone Nov 04 '16 at 14:55
  • @DanieleTassone Sure, MongoDB 3.2 but not still in WiredTiger. We are planning to upgrade shortly and get some benefit from the performance improvement. – jbernal Nov 04 '16 at 15:27
  • @jbernal Ok i will think about. I understood now the point related the $match clausole at the top; it's right you cant' until you are only looking for specific _id. If you are making "complex query" you can't. So the question is what you are making in this $match? Just query-by-ids should be fine. – Daniele Tassone Nov 04 '16 at 16:53
  • @DanieleTassone not always. I might want to find documents whose name is certain string. For this reason, I want to be able to filter by any field in $match. I´ve done some research and I don´t think it is possible. I will move to third choice I think. – jbernal Nov 04 '16 at 19:30

1 Answers1

0

Just in order to complete this question, we went with option 3: one collection to keep latest versions and one collection to keep historical ones. It is introduced here: http://www.askasya.com/post/trackversions/ and some further description (with some nice code snippets) can be found in http://www.askasya.com/post/revisitversions/.

It has been running in production now for 6 months. So far so good. Former approach meant we were always using the aggregate framework which moves away from indexes as soon as you modify the original schema (using $group, $project...) as it doesn't match anymore the original collection. This was making our performance terrible as the data was growing.

With the new approach though the problem is gone. 90% of our queries goes against latest data and this means we target a collection with a simple ObjectId as identifier and we do not require aggregate framework anymore, just regular finds.

Our queries against historical data always include id and version so by indexing these (we include both as _id so we get it out of the box), reads towards those collections are equally fast. This is a point though not to overlook. Read patterns in your application are crucial when designing how your collections/schemas should look like in MongoDB so you must ensure you know them when taking such decisions.

jbernal
  • 785
  • 1
  • 14
  • 29