1

I have a mongodb collection like this:

{"uid": "01370mask4",
 "title": "hidden",
 "post: "hidden",
 "postTime": "01-23, 2016",
 "unixPostTime": "1453538601",
 "upvote": [2, 3]}

and I'd like to select post records from the users with more than 5 posts. The stucture should be the same, I just don't need the documents from users who don't have many posts.

db.collection.aggregate(
   [
     { $group : { _id : "$uid", count: { $sum: 1 } } }
   ]
)

Now I'm stuck at how to use the count values to find. I searched but didn't find any methods to add the count values back to the same collection by uid. Saving the aggregation output and joining them together seems not supported by mongodb. Please advise, thanks!

Update:

Sorry that I didn't make it clear earlier. Thanks for your prompt answers! I want a subset of the original collection, with post text, post timestamp, etc. I don't want a subset of the aggregation output.

leoce
  • 715
  • 1
  • 8
  • 24
  • I am not clear the correct field name in your schema, I just use some sample field in my answer... – zangw Feb 29 '16 at 07:26
  • can you provide a sample input document and the required output that you want? – Sarath Nair Feb 29 '16 at 08:01
  • i think your original question is very clear, later update is very confusing - how can you get the post detail with the aggregate method? do you want to select posts of the users who has more than 5 posts? – Frank Fang Feb 29 '16 at 08:04
  • @SarathNair thanks for your suggestion, I've updated it. – leoce Feb 29 '16 at 08:17
  • @FrankFang yes I want a subset of the collectioin, which includes the post records of the users who have more than 5 posts. – leoce Feb 29 '16 at 08:17

3 Answers3

2

Just add the $match after your group with the correct query and it works :

db.collection.aggregate(
  [
    { $group : { _id : "$uid", count: { $sum: 1 } } },
    { $match : { count : { $gt : 5 } }
  ]
)
throrin19
  • 17,796
  • 4
  • 32
  • 52
2

If there aren't millions of documents, then you can try a shortcut way to achieve what you are trying using one aggregate and another find query,

Aggregate query:

var users = db.collection.aggregate(
  [
    {$group:{_id:'$uid', count:{$sum:1}}},
    {$match:{count:{$gt:5}}},
    {$group:{_id:null,users:{$push:'$_id'}}}
  ]
).toArray()[0]['users']

Then it's a straight ahead query to find the particular users:

db.collection.find({uid: {$in: users}})
Sarath Nair
  • 2,828
  • 2
  • 21
  • 36
  • Thanks a lot! I did't know the `toArray` method and it worked! I have 18M documents and after the aggregation and find, I got 9M out of them, and I used `allowDiskUse: true` during the aggregation otherwise it popped out "memory exceeded" errmsg. It took a while, not fast, but it definitely solved my problem. Thanks a lot! – leoce Mar 03 '16 at 04:00
1

Please try this one to select users with more than 5 posts. To keep the original fields through using $first, if the $uid is unique, please add the field as below.

db.collection.aggregate([
     {$group: {
          _id: '$uid', 
          title: {$first: '$title'}, 
          post: {$first:'$post'}, 
          postTime:{$first: '$postTime'}, 
          unixPostTime:{$first: '$unixPostTime'},
          upvote:{$first: '$upvote'}, 
          count: {$sum: 1}
     }}, 
     {$match: {count: {$gte: 5}}}])
)

If there are multiple value for the same $uid, you should use $push to an array in the $group.


If you want to save the result to db, please try it as below

var cur = db.collection.aggregate(
   [
     {$group: {
          _id: '$uid', 
          title: {$first: '$title'}, 
          post: {$first:'$post'}, 
          postTime:{$first: '$postTime'}, 
          unixPostTime:{$first: '$unixPostTime'},
          upvote:{$first: '$upvote'}, 
          count: {$sum: 1}
     }}, 
     {$match: {count: {$gte: 5}}}
   ]
)
cur.forEach(function(doc) {
   db.collectioin.update({_id: doc._id}, {/*the field should be updated */});
});
zangw
  • 43,869
  • 19
  • 177
  • 214
  • Thanks! I tried `db.collection.aggregate( [ {$group: { _id: "$uid", title: "$title", post: "$post", postTime: "$postTime", unixPostTime: "$unixPostTime", upvote: "$upvote", count: {$sum: 1} } }, {$match: {count: {$gt: 5} } } ] )` but it kept failing: `"errmsg" : "exception: the group aggregate field 'title' must be defined as an expression inside an object", "code" : 15951, "ok" : 0` – leoce Feb 29 '16 at 08:34
  • Thanks for your continuous suggestions! I think the `$first` have conflicts with `count` because `$first` would only apply to users who have 1 post. While the `$push` method looks functional, it creates a nested document like [merging two collections in mongodb](http://stackoverflow.com/a/9723549/1907154) but I still want to keep the original structure. That's why I didn't follow the merging method and I said "joining them together (the original and the aggregation output) not supported by mongodb". – leoce Feb 29 '16 at 13:13
  • @leoce, yes, `$first` just for unique `uid` for your document. However, if there are multiple documents for one same `uid`, `$push` should be used here. – zangw Feb 29 '16 at 13:18
  • `$push` creates one document for one unique user and her/his posts become sub-documents within. However I want to keep the original structure and make, e.g., 5 posts separate. I'm still searching to see if there's some way that I can update the original collection. – leoce Feb 29 '16 at 13:24
  • @leoce, as far as I know, there is no better solution for current mongodb version. If you got another good solution. Please let me get it... – zangw Feb 29 '16 at 13:44
  • another answer suggested a "find $in var" approach and I tried it. The output looks like what I wanted. I'm usding it now unless some error is found in the collection later. – leoce Mar 03 '16 at 05:27