3

Assuming I have the following document structures:

> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
  'group_x: ['TAG-1','TAG-2'],
  'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
  'group_x: ['TAG-1'],
  'group_y': ['XYZ']
}
}

> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}

Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.

Any help is very much appreciated.

Edit:

So far, I can come up with the following:

db.logs.aggregate(
  {$match: {...}}, // some match filters here
  {$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
  {$unwind: '$tag'},
  {$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
  {$group: {_id:'$_id.tag', total:{$sum:1}}}
)

As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.

Edit:

I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.

Edit:

I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.

Edit:

Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
MervS
  • 5,724
  • 3
  • 23
  • 37

4 Answers4

1

I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:

   { user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }

if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:

{ user_id:1, address:1, date_sent:-1 }

then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)

 db.user_messages.aggregate(
 { $match: { user_id:1 } },
 { $group: {
     _id: "$address",
     count: { $sum : 1 },
     date_sent: { $max : "$date_sent" },
     message: { $first : "$message" },
 } }
);

It's not documented that it should work like this - but it does. We use it on production system.

notz
  • 2,458
  • 2
  • 15
  • 11
  • Some comments: I am thinking of performing a `$project` first to limit the fields that will go in the pipeline. Also, my `tags` field is an array so I will have to perform an `$unwind` right? Correct me if I'm wrong, but after these operations, indexes are no longer be usable, aren't they? – MervS Mar 19 '13 at 02:08
  • yes, if you don't need all fields you should a $project pipline to save memory. yes you will need a $unwind pipeline - haven't tested if you can use a key with this - you should give it a try. – notz Mar 19 '13 at 23:11
  • From http://docs.mongodb.org/manual/applications/aggregation/#pipeline-operators-and-indexes, it appears that I may not be able to take advantage of indexes in those operations. I forgot to mention that in my actual schema, the 'tags' array is in an embedded document and I will need `$project` to move it on the first level of the document. – MervS Mar 20 '13 at 02:32
  • I tried your suggestion but I found out that, using the example you provided, it will only group the `user_messages` by `address` but with each `count` still including the older `message`s for each group. – MervS Mar 25 '13 at 05:45
1

I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.

Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.

Example:

db.searchResult.update(       
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}},  // or embed your blog post here
... {upsert:true}
)
balafi
  • 2,143
  • 3
  • 17
  • 20
0

Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:

db.posts.aggregate([
    {$sort: {created_at:-1}},
    {$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
    {$unwind: '$tags'},
    {$group: {_id: {author: '$_id', tag: '$tags'}}}
]);

As you said this is not optimal however, it is all I have come up with.

If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:

{
    _id: {},
    author: {},
    tag: 'something',
    created_at: ISODate(),
    post_id: {}
}

And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.

Sammaye
  • 43,242
  • 7
  • 104
  • 146
  • The problem is that pre-aggregation will be hard to implement since in my use case grouping is dynamic, the user can initially perform filtering out of `posts` in a variety of ways, e.g. within a range of dates, matching a set of tags, etc. So, the latest post for a group may be relative (I hope you get the point). – MervS Mar 26 '13 at 08:44
  • @nyde1319 Why do you need the document output you do? Can't you just match by the tags in the latest post of the author? Why do you need thye output of unwinded documents per author and tag? – Sammaye Mar 26 '13 at 08:49
  • In my use case, the the user can perform filters on a collection of posts. Now the user also has the option to group the results based on a set of tags. In my example, the the results can be grouped by distinct values of `tag_doc` or `other_tags`. In each resulting group, only the latest posts should be accounted. We can say some posts can be different versions of a single post. Maybe a post-author example may not have been a good analogy to my actual case. – MervS Mar 26 '13 at 08:56
  • @nyde1319 hmm it sounds like the question you have given is not your actual use case. I looked at the linked answer you give above but that SQL query is about as scalable as the aggregation query I give above, it is useless. Can you give an example of some searches that can be performed? I need to get an idea of exactly what queries will exist – Sammaye Mar 28 '13 at 08:18
  • Sorry, I was on a holiday so I am not able to visit the site for days. OK, so I edited the question to match my actual use case. So I am dealing with logs resulting from execution of test cases. The requirement is that I have to group the latest logs for each test cases based on unique group tags (which is in `tag_doc`). A sample search is that users can filter logs within a range of dates. This is tricky because one cannot use pre-aggregation because the latest uploaded log may not be within the range selected by the user. – MervS Apr 01 '13 at 01:57
0

Here you go:

db.logs.aggregate(
  {"$sort"     : { "uploaded_at" : -1 } },
  {"$match"    : { ... } }, 
  {"$unwind"   : "$tag_doc.group_x" },
  {"$group"    : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'}, 
                   "latest" : { "$first" : "$uploaded_at"},
                   "Name" : { "$first" : "$Name" },
                   "tag_doc" : { "$first" : "$tag_doc"}
                 }
  }
);

You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • I assume the `$match` and `$sort` pipelines are interchangeable and will both use indexes. – MervS Sep 19 '13 at 02:49
  • correct - if there is an index that supports them both, they will be combined during processing. – Asya Kamsky Sep 19 '13 at 19:51
  • One more thing, when you use `$first`, can you assure that the values of the fields `latest`, `Name`, and `tag_doc` belong to the same document? – MervS Sep 20 '13 at 00:48
  • Yes, when you sorted, the full document that's first has the values that will end up "$first" - the only potential gotcha is if that field may be absent - you can $project using $ifNull and fill in a default value if that's the case though. – Asya Kamsky Sep 20 '13 at 03:02