3

I'm trying to look up all records that match a certain condition, in this case _id being certain values, and then return only the top 2 results, sorted by the name field.

This is what I have

db.getCollection('col1').aggregate([
    {$match: {fk: {$in: [1, 2]}}},
    {$sort: {fk: 1, name: -1}},
    {$group: {_id: "$fk", items: {$push: "$$ROOT"} }},
    {$project: {items: {$slice: ["$items", 2]} }}
])

and it works, BUT, it's not guaranteed. According to this Mongo thread $group does not guarantee document order.

This would also mean that all of the suggested solutions here and elsewhere, which recommend using $unwind, followed by $sort, and then $group, would also not work, for the same reason.

What is the best way to accomplish this with Mongo (any version)? I've seen suggestions that this could be accomplished in the $project phase, but I'm not quite sure how.

buræquete
  • 14,226
  • 4
  • 44
  • 89
Adam Rackis
  • 82,527
  • 56
  • 270
  • 393
  • 1
    Just for tests sake would it be possible to include some example data? Also would $sort after $group work? – FabioCosta Sep 23 '19 at 15:28
  • Data are just some objects with `fk` and `name` fields. Sort after group would be great - is it possible to sort nested arrays? All answers I've seen involve $unwind followed by group, which runs into the very problems I'm trying to solve here – Adam Rackis Sep 23 '19 at 15:43
  • 1
    OK I couldn't find any solution that didn't ended up on group. – FabioCosta Sep 25 '19 at 11:36
  • 1
    I don't understand the issue, the nested `items` array will always be correctly sorted with `name: -1` since the `$group` does not keep order of the resulting objects, but the `$push` will keep the previous order of `name` fields within the `items` array. Just removing `fk` sort from the pre-`$group` stage to post-`$group` stage should work. Like [here](https://mongoplayground.net/p/dLQ5KHgTpHr), `items` arrays will always have the correct order, even though the main result set order is random. You can enforce that with my suggestion, like [here](https://mongoplayground.net/p/0Qd2j6i-iVB) – buræquete Sep 30 '19 at 01:32
  • 1
    I think you might have been confused on that documentation point. Whilst true that `$group` does not guarantee any order of output for the resulting "grouped" documents, when you do something as you have which is to `$sort` **before** you actually `$group` then **items will be added to the array in the same sorted order**. That is naturally **guaranteed**, otherwise there simply would be little point in implementing the `$sort`. – Neil Lunn Sep 30 '19 at 01:34
  • 1
    So `{ "a": 1, "b": 2 }, { "a": 1, "b": 1 }, { "a": 2, "b": 3 }` would indeed *possibly* collect so that for an `_id: "$a"` grouping key might be in a different order, i.e 2 before 1 on "grouping" but if you did something like `[{ "$sort": { "a": 1, "b": 1 }},{ "$group": { "_id": "$a", "list": { "$push": "$b" } } }]` then the `list` is **always** in the sorted order, even if the `_id` key values are not. Nothing would cause the "pushed" array elements to be in any unexpected order after a `$sort` was initially applied. – Neil Lunn Sep 30 '19 at 01:37
  • 1
    Also FYI, this would actually not be the best approach to a "limitted push result". If you really want arrays with a limited number of items, the better option is a "self join". See [mongodb group values by multiple fields](https://stackoverflow.com/a/22935461/2313887) where I actually cover this right at the top of the response. – Neil Lunn Sep 30 '19 at 01:47
  • This is based on a misunderstanding of the linked comment. You are sorting documents as they go into $group stage so they are guaranteed to be ordered inside the $group array. There is no need to sort the array. However, as of 3.6 there's a faster way to guarantee correct result (I'll add as an answer) – Asya Kamsky Jul 07 '20 at 18:24

3 Answers3

8

You are correct in saying that the result of $group is never sorted.

$group does not order its output documents.

Hence doing a;

{$sort: {fk: 1}}

then grouping with

{$group: {_id: "$fk", ... }}, 

will be a wasted effort.


But there is a silver lining with sorting before $group stage with name: -1. Since you are using $push (not an $addToSet), inserted objects will retain the order they've had in the newly created items array in the $group result. You can see this behaviour here (copy of your pipeline)

The items array will always have;

"items": [
  {
    ..
    "name": "Michael"
  },
  {
    ..
    "name": "George"
  }
]

in same order, therefore your nested array sort is a non-issue! Though I am unable to find an exact quote in documentation to confirm this behaviour, you can check;

  • this,
  • or this where it is confirmed.
  • Also, accumulator operator list for $group, where $addToSet has "Order of the array elements is undefined." in its description, whereas the similar operator $push does not, which might be an indirect evidence? :)

Just a simple modification of your pipeline where you move the fk: 1 sort from pre-$group stage to post-$group stage;

db.getCollection('col1').aggregate([
    {$match: {fk: {$in: [1, 2]}}},
    {$sort: {name: -1}},
    {$group: {_id: "$fk", items: {$push: "$$ROOT"} }},
    {$sort: {_id: 1}},
    {$project: {items: {$slice: ["$items", 2]} }}
])

should be sufficient to have the main result array order fixed as well. Check it on mongoplayground

buræquete
  • 14,226
  • 4
  • 44
  • 89
  • Great answer, and it makes sense! So it would seem that $group is guaranteed to ***PROCESS*** the ***INPUT*** docs in the same order, which would also guarantee the push'd arrays have the correct order; it's just that the output groups will be in non-deterministic order. If you can find that documented somewhere that'd be a huge bonus, but otherwise I think I'm happy. – Adam Rackis Sep 30 '19 at 16:34
  • @AdamRackis I am glad, yet I cannot find an exact detailing of the ordered process of `$group` stage, but you can guess that is the case from such operators like [`$first`](https://docs.mongodb.com/manual/reference/operator/aggregation/first/) and [`$last`](https://docs.mongodb.com/manual/reference/operator/aggregation/last/), for those to work, `$group` operation has to respect the order of previous stage isn't it (or maybe only for some operators)? In their documentation, there is this quote `"Only meaningful when documents are in a defined order."`, which is another evidence. – buræquete Sep 30 '19 at 23:30
  • while the answer is correct, the proposed pipeline is still very inefficient and would fail when there is a very large number of documents to push into the items array in $group. (see my answer for how to avoid that - and get the results much faster to boot) – Asya Kamsky Jul 09 '20 at 18:11
1

$group doesn't guarantee the document order but it would keep the grouped documents in the sorted order for each bucket. So in your case even though the documents after $group stage are not sorted by fk but each group (items) would be sorted by name descending. If you would like to keep the documents sorted by fk you could just add the {$sort:{fk:1}} after $group stage

You could also sort by order of values passed in your match query should you need by adding a extra field for each document. Something like

db.getCollection('col1').aggregate([
    {$match: {fk: {$in: [1, 2]}}},
    {$addField:{ifk:{$indexOfArray:[[1, 2],"$fk"]}}},
    {$sort: {ifk: 1, name: -1}},
    {$group: {_id: "$ifk", items: {$push: "$$ROOT"}}},
    {$sort: {_id : 1}},
    {$project: {items: {$slice: ["$items", 2]}}}
])

Update to allow array sort without group operator : I've found the jira which is going to allow sort array.

You could try below $project stage to sort the array.There maybe various way to do it. This should sort names descending.Working but a slower solution.

{"$project":{"items":{"$reduce":{
  "input":"$items",
  "initialValue":[],
  "in":{"$let":{
    "vars":{"othis":"$$this","ovalue":"$$value"},
    "in":{"$let":{
      "vars":{
        //return index as 0 when comparing the first value with initial value (empty) or else return the index of value from the accumlator array which is closest and less than the current value.
        "index":{"$cond":{
          "if":{"$eq":["$$ovalue",[]]},
          "then":0,
          "else":{"$reduce":{
            "input":"$$ovalue",
            "initialValue":0,
            "in":{"$cond":{
              "if":{"$lt":["$$othis.name","$$this.name"]},
              "then":{"$add":["$$value",1]},
              "else":"$$value"}}}}
        }}
      },
      //insert the current value at the found index
      "in":{"$concatArrays":[
          {"$slice":["$$ovalue","$$index"]},
          ["$$othis"],
          {"$slice":["$$ovalue",{"$subtract":["$$index",{"$size":"$$ovalue"}]}]}]}
    }}}}
}}}}

Simple example with demonstration how each iteration works

db.b.insert({"items":[2,5,4,7,6,3]});

othis   ovalue      index      concat arrays (parts with counts)       return value
2       []          0           [],0            [2]     [],0           [2]
5       [2]         0           [],0            [5]     [2],-1         [5,2]          
4       [5,2]       1           [5],1           [4]     [2],-1         [5,4,2]
7       [5,4,2]     0           [],0            [7]     [5,4,2],-3     [7,5,4,2]
6       [7,5,4,2]   1           [7],1           [6]     [5,4,2],-3     [7,6,5,4,2]
3       [7,6,5,4,2] 4           [7,6,5,4],4     [3]     [2],-1         [7,6,5,4,3,2]

Reference - Sorting Array with JavaScript reduce function

s7vr
  • 73,656
  • 11
  • 106
  • 127
  • `$group doesn't guarantee the document order but it would keep the grouped documents in the sorted order for each bucket` - if that's true, then I believe my entire question would unnecessary—which would be great. Do you have some sort of documentation tho, to show that the buckets would stay in sorted order? The original Mongo thread I linked, above, seems to show pretty clearly that `$group doesn't group by the order of input documents` which is the whole point, here :( – Adam Rackis Sep 29 '19 at 14:40
  • 1
    Also, I appreciate the huge code sample showing how to sort the array manually, with Reduce, but I quickly abandoned that possibility early on: it's clearly an O(N^2) operation, right, since you have nested reduce's? I'm not against manual sorting in the project stage, I'm just hoping for something more performant. – Adam Rackis Sep 29 '19 at 14:41
  • 1
    I think myself and others are actually reading the question as *"How to guarantee the order of the `$push`'ed items"*. So I believe the question incorrectly presumes the commentary on *"`$group` does not maintain order"* actually has that effect on the arrays **within** the grouping. So whilst an "inline sort" would be *nice* ( I have been complaining about this for years ) it's not actually necessary since `$group` will not affect the array in the way the question asks/presumes. – Neil Lunn Sep 30 '19 at 01:43
  • Just adding my comment on what others have noted, There are couple of things. One is sort option for array which will be supported in near future (more in the linked jira issue) as unwind solution doesn’t scale for large arrays. The jira issue you linked talking about maintaining the order of output documents not accumulator like ($first, $last and $push with the exception of $addToSet as it is set) all takes the sort order from previous sort or those accumulators will not make any sense. – s7vr Sep 30 '19 at 02:15
  • Also noticed the docs for push doesn't mention anything about the sort order like they do for $first/$last. I've created a docs update request [DOCS-13063](https://jira.mongodb.org/browse/DOCS-13063) to add the sort behavior. – s7vr Sep 30 '19 at 16:14
  • 1
    Hey just wanted to thank you for all the great info, here, +1. The other answer zero'd in on what I was really looking for, so I gave that person the bounty. I hope there's no hard feelings! Hopefully others will come along and give you some more upvotes. – Adam Rackis Oct 02 '19 at 02:20
  • you are all correct. this isn't efficient but the question assumes facts not in evidence. :) – Asya Kamsky Jul 09 '20 at 18:09
1

There is a bit of a red herring in the question as $group does guarantee that it will be processing incoming documents in order (and that's why you have to sort of them before $group to get an ordered arrays) but there is an issue with the way you propose doing it, since pushing all the documents into a single grouping is (a) inefficient and (b) could potentially exceed maximum document size.

Since you only want top two, for each of the unique fk values, the most efficient way to accomplish it is via a "subquery" using $lookup like this:

db.coll.aggregate([
 {$match: {fk: {$in: [1, 2]}}},
 {$group:{_id:"$fk"}}, 
 {$sort: {_id: 1}},
 {$lookup:{
      from:"coll", 
      as:"items", 
      let:{fk:"$_id"},
      pipeline:[ 
           {$match:{$expr:{$eq:["$fk","$$fk"]}}}, 
           {$sort:{name:-1}},
           {$limit:2}, 
           {$project:{_id:0, fk:1, name:1}}
      ]
 }}
])

Assuming you have an index on {fk:1, name:-1} as you must to get efficient sort in your proposed code, the first two stages here will use that index via DISTINCT_SCAN plan which is very efficient, and for each of them, $lookup will use that same index to filter by single value of fk and return results already sorted and limited to first two. This will be the most efficient way to do this at least until https://jira.mongodb.org/browse/SERVER-9377 is implemented by the server.

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • I guess you don't even need the $project inside the $lookup pipeline since with $push of "$$ROOT" you get the entire document, so just leave that off if not needed. – Asya Kamsky Jul 09 '20 at 18:17