1

I'm trying to use the aggregation framework to group a lot of strings together to indentify the unique ones. I must also keep some information about the rest of the fields. This would be analogous to me using the * operator in mysql with a group by statement.

SELECT * 
FROM my_table
GROUP BY field1 

I have tried using the aggregation framework, and it works fine just to get unique fields.

db.mycollection.aggregate({ 
    $group : { _id : "$field1"}
    })

What if I want the other fields that went with that. MySQL would only give me the first one that appeared in the group (which I'm fine with). Thats what I thought the $first operator did.

db.mycollection.aggregate({
  $group : { 
       _id : "$field1",
       another_field : {$first : "$field2"}
  }})

This way it groups by field1 but still gives me back the other fields attached to document. When I try this I get:

 exception: aggregation result exceeds maximum document size (16MB)

Which I have a feeling is because it is returning the whole aggregation back as one document. Can I return it as another json array?

thanks in advance

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
jwillis0720
  • 4,329
  • 8
  • 41
  • 74

3 Answers3

3

You're doing the aggregation correctly, but as the error message indicates, the full result of the aggregate call cannot be larger than 16 MB.

Work-arounds would be to either add a filter to reduce the size of the result or use map-reduce instead and output the result to another collection.

JohnnyHK
  • 305,182
  • 66
  • 621
  • 471
  • be careful you need to sort before taking the first element: http://docs.mongodb.org/manual/reference/aggregation/first/#grp._S_first – Bill'o Oct 03 '13 at 14:01
2

If you unique values of the result does not exceed 2000 you could use group() function like

db.mycollection.group( {key : {field1 : 1, field2 : 1}}, reduce: function(curr, result){}, initial{} })

Last option would be map reduce:

db.mycollection.mapReduce( function() { emit( {field1 :1, field2: 1}, 1); }, function(key, values) { return 1;}, {out: {replace: "unique_field1_field2"}})

and your result would be in "unique_field1_field2" collection

Stackee007
  • 3,196
  • 1
  • 26
  • 39
1

Another alternative is use the distinct function:

db.mycollection.distinct('field1')

This functions accepts a second argument, a query, where you can filter the documents.

Miguel Cartagena
  • 2,576
  • 17
  • 20
  • This is handy but does not solve the problem of returning all the other fields with the distinct document. – jwillis0720 Dec 27 '12 at 01:46
  • Sorry, I didn't realize this requirement. As the results are too big, you could use the $skip and $limit pipelines to partition the results. The trade-off here, is that you'll need run the aggregation many times and programmatically merge the results. – Miguel Cartagena Dec 27 '12 at 11:15