The highest accepted answer here has this:
uniqueIds: { $addToSet: "$_id" },
That would also return to you a new field called uniqueIds with a list of ids. But what if you just want the field and its count? Then it would be this:
db.collection.aggregate([
{$group: { _id: {name: "$name"},
count: {$sum: 1} } },
{$match: { count: {"$gt": 1} } }
]);
To explain this, if you come from SQL databases like MySQL and PostgreSQL, you are accustomed to aggregate functions (e.g. COUNT(), SUM(), MIN(), MAX()) which work with the GROUP BY statement allowing you, for example, to find the total count that a column value appears in a table.
SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type |
+----------+-----------------+
| 3 | Contact |
| 1 | Practice |
| 1 | Prospect |
| 1 | Task |
+----------+-----------------+
As you can see, our output shows the count that each my_type value appears. To find duplicates in MongoDB, we would tackle the problem in a similar way. MongoDB boasts aggregation operations, which group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. It's a similar concept to aggregate functions in SQL.
Assuming a collection called contacts, the initial setup looks as follows:
db.contacts.aggregate([ ... ]);
This aggregate function takes an array of aggregation operators, and in our case, we desire the $group operator, since our goal is to group the data by the field's count, that is, the number of occurances of the field value.
db.contacts.aggregate([
{$group: {
_id: {name: "$name"}
}
}
]);
There's a little idiosyncracy to this approach. The _id field is required to use the group by operator. In this case, we are grouping the $name field. The key name within _id can have any name. But we use name since it is intuitive here.
By running the aggregation using only the $group operator, we will get a list of all the name fields (regardless if they appear once or more than once in the collection):
db.contacts.aggregate([
{$group: {
_id: {name: "$name"}
}
}
]);
{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }
Notice above how aggregation works. It took documents with name fields and returns a new collection of the name fields extracted.
But what we want to know is how many times does the field value reappear. The $group operator takes a count field which uses the $sum operator to add the expression 1 to the total for each document in the group. So the $group and $sum together returns the collective sum of all the numeric values that result for a given field (e.g. name).
db.contacts.aggregate([
{$group: {
_id: {name: "$name"},
count: {$sum: 1}
}
}
]);
{ "_id" : { "name" : "John" }, "count" : 1 }
{ "_id" : { "name" : "Joan" }, "count" : 3 }
{ "_id" : { "name" : "Stephen" }, "count" : 2 }
{ "_id" : { "name" : "Rod" }, "count" : 3 }
{ "_id" : { "name" : "Albert" }, "count" : 2 }
{ "_id" : { "name" : "Amanda" }, "count" : 1 }
Since the goal was to eliminate duplicates, it requires one extra step. To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1.
db.contacts.aggregate([
{$group: { _id: {name: "$name"},
count: {$sum: 1} } },
{$match: { count: {"$gt": 1} } }
]);
{ "_id" : { "name" : "Joan" }, "count" : 3 }
{ "_id" : { "name" : "Stephen" }, "count" : 2 }
{ "_id" : { "name" : "Rod" }, "count" : 3 }
{ "_id" : { "name" : "Albert" }, "count" : 2 }
As a side note, if you are using MongoDB through a ORM like Mongoid for Ruby, you might get this error:
The 'cursor' option is required, except for aggregate with the explain argument
This most likely means your ORM is out of date and is performing operations that MongoDB no longer supports. Consequently, either update your ORM or find a fix. For Mongoid, this was the fix for me:
module Moped
class Collection
# Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries. This overrides
# `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
#
# Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
#
# Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
# the command includes the `explain` option. Unless you include the `explain` option, you must specify the
# `cursor` option.
#
# To indicate a cursor with the default batch size, specify `cursor: {}`.
#
# To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
#
def aggregate(*pipeline)
# Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
end
private
def extract_result(response)
response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
end
end
end