4

I have a process that I'm currently using Mongo's Map/Reduce framework for, but it's not performing very well. It's a pretty simple aggregation, where I bucketize over 3 fields, returning the sum of 4 different fields, and passing through the values for another 4 fields (which are constant within each bucket).

For reasons described in [ Map-Reduce performance in MongoDb 2.2, 2.4, and 2.6 ], I'd like to convert this to the aggregation framework for better performance, but there are 3 things standing in the way, I think:

  1. The total result can be large, exceeding Mongo's 16MB limit, even though any one document in the result is very small.
  2. I can map/reduce directly to another collection, but the aggregation framework can only return results inline (I think?)
  3. For incremental updates as more data arrives in the source collection, I can map/reduce with MapReduceCommand.OutputType (in Java) set to REDUCE, exactly matching my use case, but I don't see a corresponding functionality in the aggregation framework.

Are there good ways to solve these in the aggregation framework? The server is version 2.4.3 right now - we can probably update as needed if there are new capabilities.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • 1
    1) Can you "chunk" the work into smaller segments? 2) Only inline 3) Aggregation is a run once deal (although you could save the results into another collection thru a client). – WiredPrairie Jun 20 '13 at 16:26
  • For 3), aggregation can easily be an incremental operation if your first step is a `{$match: ...}` to select just the new data. – Ken Williams Jun 20 '13 at 16:35
  • The worst part here is that the pipeline is very simple and it runs very fast, but I can't actually use the results. – Ken Williams Jun 20 '13 at 17:22
  • Why can't you use the results? – WiredPrairie Jun 20 '13 at 19:14
  • Because the total size is larger than 16MB. – Ken Williams Jun 20 '13 at 21:08
  • That's why I was asking if you could chunk the aggregation into smaller segments and save them into another collection? – WiredPrairie Jun 20 '13 at 21:58
  • It's possible maybe, but not easy. It's hard to predict the size of any given chunk's output, and the chunking scheme may need to be reworked as the data changes. What I mean by my "worst part" comment is that Mongo seems to have no problem calculating the results, but chokes in actually returning them. – Ken Williams Jun 21 '13 at 02:27
  • How many results at most can be returned from the pipeline in the end? – Asya Kamsky Jun 21 '13 at 16:56

2 Answers2

4

You can do that now with $out as explained in mongo

$out Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator lets the aggregation framework return result sets of any size. The $out operator must be the last stage in the pipeline.

The command has the following syntax, where is collection that will hold the output of the aggregation operation. $out is only permissible at the end of the pipeline:

db.<collection>.aggregate( [
     { <operation> },
     { <operation> },
     ...,
     { $out : "<output-collection>" }
] )
lesolorzanov
  • 3,536
  • 8
  • 35
  • 53
  • Unfortunately that still doesn't address issue #3 in the original question, but I agree it's a big step forward. – Ken Williams Apr 19 '14 at 02:46
  • Can't do use $out with pre 2.6 though. Thoughts on how to handle aggregation to another collection when $out is not available? – conner.xyz Aug 15 '15 at 20:21
1

The Aggregation framework currently cannot be outputted to another collection directly. However you can try the answer in this discussion: SO-questions-output aggregate to new collection The mapreduce is way slower and I too have been waiting for a solution. You can try the Hadoop to Mongodb connector, which is supported in the mongodb website. Hadoop is faster at mapreduce. But I do not know if it would be well suited in your specific case.

Link to hadoop + MongoDB connector

All the best.

Community
  • 1
  • 1
Sai
  • 461
  • 7
  • 25
  • 1
    That solves problem #2, but unfortunately not #1 and #3. I could probably solve #3 by getting the results & reducing manually. But the only way I can think of to solve #1 is to bite off much smaller pieces at a time, which is hard to figure out in advance of the query. – Ken Williams Jun 20 '13 at 16:32
  • 1
    Clarification - when I say "that", I meant your first idea. I'll have a look at the Hadoop+MongoDB connector, I haven't looked at it before. Thanks. – Ken Williams Jun 20 '13 at 16:34
  • Yes breaking it up is probably the only way to solve #1(sorry doesn't help much). But that is mostly impractical as the data gets larger. I have had similar issues in the past with the aggregation framework's limited usability. Hope MongoDB comes up with an upgade to either Mapreduce or Aggregate framework. – Sai Jun 20 '13 at 16:38
  • 2
    or just wait for 2.6 which will return a cursor from aggregation framework and allow you to output results to another collection. – Asya Kamsky Jun 21 '13 at 16:47
  • That's great to hear @AsyaKamsky, I didn't know that. Is there a roadmap online somewhere showing what's scheduled to be in 2.6? – Ken Williams Jun 25 '13 at 15:57
  • 2
    Yep - you can see it in its crudest form in the "fix version" field of any server ticket: https://jira.mongodb.org/browse/SERVER-3253 which you already saw, and you can see by clicking on the fix version where it fits within the "sequence" of versions - i.e. 2.5.w is the top bucket to go into 2.6 release, before all the 2.5.x and 2.5.desired, etc. – Asya Kamsky Jun 25 '13 at 16:48