Find duplicates across multiple MongoDb collections

Question

Coming from MySQL, I am wondering, how can we find duplicates across multiple collections in MongoDB ?

Let say I have two (or more) collections :

human :

_id
firstname

cat:

_id
nickname

What would be an efficient solution to list duplicated names. This includes if a name is used by 2+ users only, by 2+ cats only, or by at least one user and one cat. Our result should therefor contains duplicates of both collections AND duplicates across those collections (cats and humans with the same name)

Expected result :

The list of the duplicated values, the number of occurence could be interesting but is not essential.

Question is not about whether or not the proposed db schema would be appropriate in this situation, but about the best MongoDB solution.

Edit

My description of a duplicate was not really what I intended it, if it is not existing in one collection but is duplicated in another collection it still is a duplicate

It's not for a real use-case just for knowledge so let's say the last one, 3.4 — DessDess, Dec 05 '16 at 12:18

Sergey Berezovskiy · Accepted Answer · 2016-12-05T11:34:24.593

1

For two collections with MongoDB 3.2 you can use $lookup aggregation (it's equivalent to left outer join you have used in MySQL):

db.human.aggregate([
    {$group: {_id: "$firstname"}},
    {$lookup: {
           from: "cat",
           localField: "_id",
           foreignField: "nickname",
           as: "cats"
         }
    },
    {$match:{cats:{$ne:[]}}},
    {$project: {catsCount:{$size:"$cats"}}}])

Stages:

Group humans by name as there could be humans with same name
Attach to each group array of cats which have same nickname as group id (i.e. human firstname)
Filter out those humans which don't have any matches in cats collection
Project result to get only names and count of matches

Result will look like

[
  { _id: "Bob", catsCount: 2 },
  { _id: "Alex", catsCount: 1 }
]

NOTE: If you need to join several collections, you can apply $lookup stage several times.

edited Dec 05 '16 at 11:34

answered Dec 05 '16 at 11:19

Sergey Berezovskiy

232,247
41
429
459

@DessDess just remember that MongoDB is different from relational databases like MySQL. It offers nice scalability, schemaless documents and options for embedding data into documents (e.g. list of addresses for user). But it does not have multi-document transactions and the main purpose of joins here is reporting. – Sergey Berezovskiy Dec 05 '16 at 12:24
Yes I do agree that this is type of situation should not happen with a well design schema, it is just an open question. Regarding your answer i've discovered that my question was not well explained sorry about that, what if we also want duplicated users to be considered as duplicated ? I have updated my question accordingly – DessDess Dec 05 '16 at 12:31
Ok of course you could combine this result with 2 queries to get duplicates from each collection so your answer is still good, but is there a way to do it in a single query ? – DessDess Dec 05 '16 at 12:41
1

@DessDess unfortunately mongo supports only left joins. If you want to check duplicates not only between collections but also within collections, then you should do it at application level. I.e. you need one more query to check duplicates with `{$group: {_id:"$nickname", count: {$count:1}}}` – Sergey Berezovskiy Dec 05 '16 at 14:05

score 0 · Answer 2 · edited May 23 '17 at 11:52

A solution I found :

Code :

mapHuman = function() {
    var values = {
        name: this.firstname
    };
    emit(this._id, values);
};
mapCat = function() {
    var values = {
        name: this.name
    };
    emit(this._id, values);
};

reduce = function(k, values) {
    var result = {names: []};
    values.forEach(function(value) {
        result.names.push(value.slug);
    });
    return result;
};

db.human.mapReduce(mapUsers, reduce, {"out": {"reduce": "name"}});
db.cat.mapReduce(mapUsers, reduce, {"out": {"reduce": "name"}});

db.name.aggregate(
    {"$group" : { "_id": "$value.name", "count": { "$sum": 1 } } },
    {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
    {"$project": {"name" : "$_id", "_id" : 0}},
    {"$out": "duplicate"} 
)

db.duplicate.find()

Find duplicates across multiple MongoDb collections

2 Answers2