85

I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of rows documents)

First attempt

(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)

db.myCollection.distinct("myIndexedNonUniqueField").length

Obviously I got this error as my dataset is huge

Thu Aug 02 12:55:24 uncaught exception: distinct failed: {
        "errmsg" : "exception: distinct too big, 16mb cap",
        "code" : 10044,
        "ok" : 0
}

Second attempt

I decided to try and do a group

db.myCollection.group({key: {myIndexedNonUniqueField: 1},
                initial: {count: 0}, 
                 reduce: function (obj, prev) { prev.count++;} } );

But I got this error message instead:

exception: group() can't handle more than 20000 unique keys

Third attempt

I haven't tried yet but there are several suggestions that involve mapReduce

e.g.

Also

It seems there is a pull request on GitHub fixing the .distinct method to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34

But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?

Update:

This comment on the MongoDB official docs is not encouraging, is this accurate?

http://www.mongodb.org/display/DOCS/Aggregation#comment-430445808

Update2:

Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)

http://docs.mongodb.org/manual/applications/aggregation/

Community
  • 1
  • 1
Eran Medan
  • 44,555
  • 61
  • 184
  • 276
  • I assume you need to do this frequently or performance wouldn't matter that much. In that case I'd store the distinct values in a separate collection that's updated when you insert a new document instead of trying to do a distinct on a collection that large. Either that or I'd re-evaluate my use of MongoDb and possibly move to something else. As you found, MongoDb currently isn't good at what you're trying to do. – Tim Gautier Aug 02 '12 at 18:48
  • @TimGautier thanks, I was afraid so, it took hours to insert all those values, and I should have thought of that before :) I think I'll spend the time now to insert it to MySQL for those statistics... – Eran Medan Aug 02 '12 at 18:56
  • You can also do an incremental MR basically emulating delta indexing of aggregate data. I mean it depends on when you need the results as to what you use. I can imagine that MySQL would prolly get a lot of IO and what not from doing this (I can kill a small server with distincting just 100k docs inline on an index) but I suppose it is more flexible in querying for this sort of stuff still. – Sammaye Aug 02 '12 at 20:34
  • I disagree that mongo is not good at this sort of thing. This sort if thing is what Mongo excels at. – superluminary Jul 03 '15 at 06:49
  • 1
    Unfortunately moderator deleted my answer that I also posted on duplicate question. I can't delete it there and repost here thus link: http://stackoverflow.com/a/33418582/226895 – expert Nov 01 '15 at 20:03

4 Answers4

75

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values

pipeline = [ 
    { $group: { _id: "$myIndexedNonUniqueField"}  },
    { $group: { _id: 1, count: { $sum: 1 } } }
];

//
// Run the aggregation command
//
R = db.runCommand( 
    {
    "aggregate": "myCollection" , 
    "pipeline": pipeline
    }
);
printjson(R);

2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.

var SOURCE = db.myCollection;
var DEST = db.distinct
DEST.drop();


map = function() {
  emit( this.myIndexedNonUniqueField , {count: 1});
}

reduce = function(key, values) {
  var count = 0;

  values.forEach(function(v) {
    count += v['count'];        // count each distinct value for lagniappe
  });

  return {count: count};
};

//
// run map/reduce
//
res = SOURCE.mapReduce( map, reduce, 
    { out: 'distinct', 
     verbose: true
    }
    );

print( "distinct count= " + res.counts.output );
print( "distinct count=", DEST.count() );

Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

William Z
  • 10,989
  • 4
  • 31
  • 25
  • 5
    I downloaded Mongo 2.2 RC0, and used your 1st suggestion, and it works! and fast! thank you (well done 10gen...) Created a gist here (used the shortcut aggregate command and put it in one line) https://gist.github.com/3241616 – Eran Medan Aug 02 '12 at 22:54
  • @EranMedan I should warn you though, I didn't suggest the aggregation framework because 2.2 rc0 is still not really ready for full deployment, just something to bare in mind, I would wait until the full release of 2.2 before recommending deployment of the aggregation framework. – Sammaye Aug 03 '12 at 08:49
  • @Sammaye yes, thanks I'm aware of it, will not go to production yet, I needed that for internal statistics and wanted to avoid moving data to SQL if possible (and quench my curiosity) – Eran Medan Aug 03 '12 at 14:12
  • Why won't Mongo accept :this.plugins.X-Powered-By.string ? How would I escape this? – EarlyPoster Dec 04 '12 at 22:33
  • I'm wondering if this answer is reliable for a sharded environment. As I understand it, shards will each do their own aggregation and then return the result where the results will then be aggregated. So in this scenario, wouldn't we have the opportunity for duplicates to exist since the distinct values have been lost in the second `$group` statement before being passed back to mongos? – Verran Apr 02 '15 at 23:57
  • why I always get {"_id": 1, "count": 1} – shenyan Mar 02 '17 at 03:35
  • I missed the $ before field name – shenyan Mar 02 '17 at 06:52
  • @William Z .. How to do map reduce or even pipeline option if the nonuniqueindex is composite and both are set as 1 ? – user2854333 Aug 14 '17 at 21:04
38
db.myCollection.aggregate( 
   {$group : {_id : "$myIndexedNonUniqueField"} }, 
   {$group: {_id:1, count: {$sum : 1 }}});

straight to result:

db.myCollection.aggregate( 
   {$group : {_id : "$myIndexedNonUniqueField"} }, 
   {$group: {_id:1, count: {$sum : 1 }}})
   .result[0].count;
Scott
  • 16,711
  • 14
  • 75
  • 120
Stackee007
  • 3,196
  • 1
  • 26
  • 39
  • 1
    Right, that's better. But isn't that the same answer that William already provided? – JohnnyHK Mar 04 '13 at 22:01
  • 3
    Similar, but I like the fact that it's on one line. I got an error though: "Cannot read property '0' of undefined" Remove the last line and it works beautifully. – Nico Mar 05 '16 at 21:49
  • 1
    and if we talk about realy huge database, don't forget { allowDiskUse: true } so, db.myCollection.aggregate( [ {$group ..}, {$group:}], { allowDiskUse: true }).result[0].count; – hi_artem Mar 19 '18 at 10:34
1

Following solution worked for me

db.test.distinct('user'); [ "alex", "England", "France", "Australia" ]

db.countries.distinct('country').length 4

Munib mir
  • 43
  • 2
  • 2
    Of course it works for small sets. The question was about datasets with millions of records, where your solution will likely throw the following error: "MongoError: distinct too big, 16mb cap" – manuna Jan 26 '21 at 16:47
1
db.myCollection.aggregate([
    {$group: {_id: "$myIndexedNonUniqueField"}},
    {$count: "count"}
]).next().count;

This uses the MongoDB aggregation functionality to retrieve the total count. It uses the $group and $count aggregation pipeline stages to first group by the unique field, then return the total number of distinct groupings as a field called count. It retrieves the first (and only) element of the result cursor using next(), and retrieves the count property from that item.

Aggregation result:

[ { count: 227807 } ]
M. Justin
  • 14,487
  • 7
  • 91
  • 130