28

Assuming I have a collection called "posts" (in reality it is a more complex collection, posts is too simple) with the following structure:

> db.posts.find()

{ "_id" : ObjectId("50ad8d451d41c8fc58000003"), "title" : "Lorem ipsum", "author" : 
"John Doe", "content" : "This is the content", "tags" : [ "SOME", "RANDOM", "TAGS" ] }

I expect this collection to span hundreds of thousands, perhaps millions, that I need to query for posts by tags and group the results by tag and display the results paginated. This is where the aggregation framework comes in. I plan to use the aggregate() method to query the collection:

db.posts.aggregate([
  { "$unwind" : "$tags" },
  { "$group" : {
      _id: { tag: "$tags" },
      count: { $sum: 1 }
  } }
]);

The catch is that to create the paginator I would need to know the length of the output array. I know that to do that you can do:

db.posts.aggregate([
  { "$unwind" : "$tags" },
  { "$group" : {
      _id: { tag: "$tags" },
      count: { $sum: 1 }
  } }
  { "$group" : {
      _id: null,
      total: { $sum: 1 }
  } }
]);

But that would discard the output from previous pipeline (the first group). Is there a way that the two operations be combined while preserving each pipeline's output? I know that the output of the whole aggregate operation can be cast to an array in some language and have the contents counted but there may be a possibility that the pipeline output may exceed the 16Mb limit. Also, performing the same query just to obtain the count seems like a waste.

So is obtaining the document result and count at the same time possible? Any help is appreciated.

Krunal Sonparate
  • 1,122
  • 10
  • 29
MervS
  • 5,724
  • 3
  • 23
  • 37
  • 2
    Do you really need a completely accurate total count, or would an approximation do? Then again, it looks like you're counting all posts so isn't that just a count() operation you can do? – cirrus Nov 23 '12 at 13:48
  • I am actually doing a count on a group of posts so count() will not do. – MervS Nov 24 '12 at 07:10
  • perfect solution to obtain the total while preserving the result in aggregation pipe http://stackoverflow.com/a/39784851/3666966 – Kanak Singhal Sep 30 '16 at 06:23

2 Answers2

34
  1. Use $project to save tag and count into tmp
  2. Use $push or addToSet to store tmp into your data list.

Code:

db.test.aggregate(
    {$unwind: '$tags'}, 
    {$group:{_id: '$tags', count:{$sum:1}}},
    {$project:{tmp:{tag:'$_id', count:'$count'}}}, 
    {$group:{_id:null, total:{$sum:1}, data:{$addToSet:'$tmp'}}}
)

Output:

{
    "result" : [
            {
                    "_id" : null,
                    "total" : 5,
                    "data" : [
                            {
                                    "tag" : "SOME",
                                    "count" : 1
                            },
                            {
                                    "tag" : "RANDOM",
                                    "count" : 2
                            },
                            {
                                    "tag" : "TAGS1",
                                    "count" : 1
                            },
                            {
                                    "tag" : "TAGS",
                                    "count" : 1
                            },
                            {
                                    "tag" : "SOME1",
                                    "count" : 1
                            }
                      ]
              }
      ],
      "ok" : 1
}
Chien-Wei Huang
  • 1,773
  • 1
  • 17
  • 27
  • I am not aware that you can include multiple fields within a new field in the `$project` pipeline. This is exactly what I need. Thanks. – MervS Nov 24 '12 at 07:12
  • 2
    Is it possible to $sort, $skip or $limit the data using this method? You can't limit the data before $project, and I don't see a way to do it after without losing the 'total'. Also, {$sort:{'data.count': 1}} doesn't seem to work as it typically does in non-aggregate queries. – Michael DePetrillo Aug 11 '14 at 15:14
  • 1
    @MichaelDePetrillo, check this out https://stackoverflow.com/questions/20348093/mongodb-aggregation-how-to-get-total-records-count – gmoksh May 01 '18 at 06:50
3

I'm not sure you need the aggregation framework for this other than counting all the tags eg:

db.posts.aggregate(
  { "unwind" : "$tags" },
  { "group" : {
      _id: { tag: "$tags" },
      count: { $sum: 1 }
  } }
);

For paginating through per tag you can just use the normal query syntax - like so:

db.posts.find({tags: "RANDOM"}).skip(10).limit(10)
Ross
  • 17,861
  • 2
  • 55
  • 73
  • That would work if I am searching for all posts with tag "RANDOM" but I am actually grouping posts by their tags and the resulting group is what is needing paging. – MervS Nov 24 '12 at 07:14