Find duplicate records in MongoDB

Question

How would I find duplicate fields in a mongo collection.

I'd like to check if any of the "name" fields are duplicates.

{
    "name" : "ksqn291",
    "__v" : 0,
    "_id" : ObjectId("540f346c3e7fc1054ffa7086"),
    "channel" : "Sales"
}

Many thanks!

score 375 · Accepted Answer · edited Sep 08 '20 at 10:19

375

Use aggregation on name and get name with count > 1:

db.collection.aggregate([
    {"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
    {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
    {"$project": {"name" : "$_id", "_id" : 0} }
]);

To sort the results by most to least duplicates:

db.collection.aggregate([
    {"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
    {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
    {"$sort": {"count" : -1} },
    {"$project": {"name" : "$_id", "_id" : 0} }     
]);

To use with another column name than "name", change "$name" to "$column_name"

edited Sep 08 '20 at 10:19

Prakash Harvani

1,007
7
18

answered Nov 18 '14 at 01:18

anhlc

13,839
4
33
40

2

`"$match": {"_id" :{ "$ne" : null } `- is unnecessary here, since the second part of the statement would suffice filtering the result. So only checking for the group having `count > 1` will do. – BatScream Nov 18 '14 at 01:21
7

Tks @BatScream. { "$ne" : null } is there just in case 'name' is null or doesn't exist. Aggregation will count null as well. – anhlc Nov 18 '14 at 01:30
1

Welcome. But then why check the `_id` field. It is always guaranteed to be not null after the `group` operation. – BatScream Nov 18 '14 at 01:32
4

The `_id` of a document from a `$group` stage can be null. – wdberkeley Nov 18 '14 at 08:35
1

What will be the output of this? If i run i get all the documents what i need is i want only the duplicated id's/names. – Kannan T Dec 18 '17 at 06:38
Is there a way to also know how many duplicates per match record? – Urasquirrel May 28 '19 at 17:57
1

@Urasquirrel, use {"$project": {"name" : "$_id", "count":"$count", "_id" : 0} } – anhlc May 29 '19 at 03:27
How can I get the duplicate data, based on two fields. Basic example: let say I've collection in which I'm storing social details like: ``` [{username: 'abc', type: 'facebook'}, {username: 'abc', type: 'instagram'}] ``` so In that case I don't want only based on username, but based on both "username & type". thanks :) – Shivam Verma May 31 '21 at 09:58
1

@ShivamVerma db.collection.aggregate([ {"$group" : { "_id": {"username":"$username", "type":"$type"}, "count": { "$sum": 1 } } }, {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, {"$sort": {"count" : -1} }, {"$project": {"username" : "$_id.username", "type" : "$_id.type", "_id" : 0} } ]); – anhlc Jun 03 '21 at 05:22
Error: Exceeded memory limit for $group – Luca Ziegler Jun 21 '22 at 08:33
@LucaZiegler, add { allowDiskUse = true } option: db.collection.aggregate([..],{ allowDiskUse = true }) – anhlc Jun 22 '22 at 07:08

score 44 · Answer 2 · answered Nov 18 '14 at 01:13

44

You can find the list of duplicate names using the following aggregate pipeline:

Group all the records having similar name.
Match those groups having records greater than 1.
Then group again to project all the duplicate names as an array.

The Code:

db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])

o/p:

{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }

answered Nov 18 '14 at 01:13

BatScream

19,260
4
52
68

2

The fact that you explain what each line does makes this answer optimal. – Zap Dec 07 '20 at 10:48
How can I get the duplicate data, based on two fields. Basic example: let say I've collection in which I'm storing social details like: ``` [{username: 'abc', type: 'facebook'}, {username: 'abc', type: 'instagram'}] ``` so In that case I don't want only based on username, but based on both "username & type". thanks :) – Shivam Verma May 31 '21 at 09:59

score 19 · Answer 3 · edited Apr 27 '22 at 21:44

19

Another option is to use $sortByCount stage.

db.collection.aggregate([
  { $sortByCount: '$name' }
]

This is the combination of $group & $sort.

The $sortByCount stage is equivalent to the following $group + $sort sequence:
    { $group: { _id: <expression>, count: { $sum: 1 } } },
    { $sort: { count: -1 } }

edited Apr 27 '22 at 21:44

M. Justin

14,487
7
91
130

answered Jan 05 '22 at 04:58

Tan Dat

2,888
1
17
39

is `$name` the field name? also how can you change the sorting to sort by count descending? Thanks! – fIwJlxSzApHEZIl Jan 24 '22 at 16:03
1

To only get duplicates (as per the question), add a `$match` stage to the aggregation after the `$sortByCount`: `{$match: {count: {$gt: 1}}}` – M. Justin Apr 27 '22 at 21:39

score 15 · Answer 4 · answered Dec 13 '16 at 13:53

The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.

To improve efficiency you can add a $match to the aggregation.

db.collection.aggregate(
    {"$match": {"name" :{ "$ne" : null } } }, 
    {"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
    {"$match": {"count" : {"$gt": 1} } }, 
    {"$project": {"name" : "$_id", "_id" : 0} }
)

score 14 · Answer 5 · edited Jul 15 '20 at 10:53

14

db.getCollection('orders').aggregate([  
    {$group: { 
            _id: {name: "$name"},
            uniqueIds: {$addToSet: "$_id"},
            count: {$sum: 1}
        } 
    },
    {$match: { 
        count: {"$gt": 1}
        }
    }
])

First Group Query the group according to the fields.

Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.

edited Jul 15 '20 at 10:53

Paul Rumkin

6,737
2
25
35

answered May 24 '16 at 06:17

Aman shrivastava

156
1
7

1

haven't been able to make this one work for me too. Down voting! – Mathieu G Dec 08 '16 at 15:06
1

This post is old but may help some one . check this out I'll check in my local it's working. Even I came across one blog regarding this. Please have a look. https://www.compose.com/articles/finding-duplicate-documents-in-mongodb/ – Aman shrivastava May 02 '19 at 13:17
I was able to get it to work - edited to update to confirmed working version. – A. L. Strine Jul 15 '20 at 05:21

score 5 · Answer 6 · answered Jan 04 '22 at 20:10

5

this is how we can achieve this in mongoDB compass

answered Jan 04 '22 at 20:10

Tanzeel

455
6
16

score 2 · Answer 7 · edited Dec 14 '22 at 14:21

2

In case you need to see all duplicated rows:

db.collection.aggregate([
     {"$group" : { "_id": "$name", "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
     {"$unwind": "$data"},
     {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
]);

edited Dec 14 '22 at 14:21

mrded

4,674
2
34
36

answered Aug 04 '21 at 05:57

Andoctorey

728
9
11

Error: Line 4: Unexpected token { – dcsan Dec 22 '21 at 08:46

Julesezaar · Answer 8 · 2021-06-25T12:49:52.183

If somebody is looking for a query for duplicates with an extra "$and" where clause, like "and where someOtherField is true"

The trick is to start with that other $match, because after the grouping you don't have all the data available anymore

// Do a first match before the grouping
{ $match: { "someOtherField": true }},
{ $group: {
    _id: { name: "$name" },
    count: { $sum: 1 }
}},
{ $match: { count: { $gte: 2 } }},

I searched for a very long time to find this notation, hope I can help somebody with the same problem

score 1 · Answer 9 · edited Apr 06 '23 at 07:24

Sometimes you want to find duplicates regardless the case, when you want to create a case insensitive index for instance. In this case you can use this aggregation pipeline

db.collection.aggregate([
  {'$group': {'_id': {'$toLower': '$name'}, 'count': { '$sum': 1 }, 'duplicates': { '$push': '$$ROOT' } } },
  {'$match': { 'count': { '$gt': 1 } } }
]);

Explanation:

group by name but first change the case to lower case and push the docs to the duplicates array.
match those groups having records greater than 1 (the duplicates).

Denis Kanash · Answer 10 · 2022-11-16T00:31:56.970

0

Search for duplicates in Compass Mongo db using $sortByCount
[screenshot]: https://i.stack.imgur.com/L85QV.png

edited Nov 16 '22 at 00:31

answered Nov 16 '22 at 00:28

Denis Kanash

1
2

Find duplicate records in MongoDB

10 Answers10

Linked

Related