0

I am currently working on an online store with MongoDB, but I accidentally duplicated all the brands listed in the "wholesalers" collection. I have been looking for a query to remove or prune these duplicate entries, but thus far nothing seems to work.

db.wholesalers.find().pretty()

{
         "_id" : ObjectId("..."),
         "brands" : [
              "Seiko",
              "Breil",
              "Lorus",
              "Seiko",
              "Breil",
              "Lorus",
         ],
         "name" : "Seiko Nederlands B.V.",
         "address" : "Daniel Pichotstraat",
         "housenr" : "17-31",
         "postalcode" : "3115JB",
         "city" : "Schiedam",
         "phone" : "+31 (0)10 - 400 98 66"
         "email" : "info@seiko.nl"
         "web" : "http://www.seiko.nl/Default"
         "kind" : "Horloges",
         "country" : "",
         "brandsNetherlands" : [ ]
    }

This is an example of a single document in my database. As you can see, the brands listed in the "brands" array have all been duplicated, and I need a way to get rid of them. What is the best way to do this?

R. Zwart
  • 1
  • 1
  • 2
  • Do you want to do it in the Mongo Shell? – Mahdi Aug 01 '16 at 12:13
  • The simplest way is writing script, which will update all your records. And in future you should use operator $set to prevent duplicates. – Sabik Aug 01 '16 at 12:17

3 Answers3

1

just run the script (don't forget to backup before ^_^)

db.wholesalers.find().forEach(saler => {
  db.wholesalers.update(
    {_id: saler._id},
    {$set: { brands: [...new Set(saler.brands)] }})
});
evilive
  • 1,781
  • 14
  • 20
0

For relatively small data, you can achieve the above by using the aggregation framework where you create a new array with the distinct brand values by using the $setUnion operator in the $project stage. If an array contains duplicate entries, $setUnion ignores the duplicate entries as well as the order of the elements.

Once you get the new distinct brands field, you will need an additional field for filtering documents in the entire collection i.e it checks whether a given array has unique elements using the same concept with the $setUnion operator. Use the results array from the aggregated documents to update your collection by iterating the results cursor using its forEach() method and updating each document as follows:

db.wholesalers.aggregate([
    { 
        "$project": {
            "distinctBrands": { "$setUnion": [ "$brands", [] ] },
            "hasUniqueBrands": { 
                "$eq": [
                    { "$size": "$brands" },
                    { "$size": { "$setUnion": [ "$brands", [] ] } }
                ]
            }
        }
    },
    { "$match": { "hasUniqueBrands": false } }
]).forEach(function(doc) {
    db.wholesalers.update(
        { "_id": doc._id },
        { "$set": { "brands": doc.distinctBrands } }
    )
})

Whilst this is optimal for small collections, performance with large collections is greatly reduced since looping through a large dataset and sending each update operation per request to the server incurs a computational penalty.

The Bulk() API comes to the rescue and greatly improves performance since write operations are sent to the server only once in bulk. Efficiency is achieved since the method does not send every write request to the server (as with the current update statement within the forEach() loop) but just once in every 1000 requests, thus making updates more efficient and quicker than currently is.


Using the same concept above with the forEach() loop to create the batches, we can update the collection in bulk as follows.

In this demonstration the Bulk() API available in MongoDB versions >= 2.6 and < 3.2 uses the initializeUnorderedBulkOp() method to execute in parallel, as well as in a nondeterministic order, the write operations in the batches:

var bulk = db.wholesalers.initializeUnorderedBulkOp(),
    counter = 0; // counter to keep track of the batch update size

    db.wholesalers.aggregate([
        { 
            "$project": {
                "distinctBrands": { "$setUnion": [ "$brands", [] ] },
                "hasUniqueBrands": { 
                    "$eq": [
                        { "$size": "$brands" },
                        { "$size": { "$setUnion": [ "$brands", [] ] } }
                    ]
                }
            }
        },
        { "$match": { "hasUniqueBrands": false } }
    ]).forEach(function(doc) {
        bulk.find({ "_id": doc._id }).updateOne({ 
            "$set": { "brands": doc.distinctBrands } 
        });

        counter++; // increment counter
        if (counter % 1000 == 0) {
            bulk.execute(); // Execute per 1000 operations 
            // and re-initialize every 1000 update statements
            bulk = db.wholesalers.initializeUnorderedBulkOp();
        }
    });

The next example applies to the new MongoDB version 3.2 which has since deprecated the Bulk() API and provided a newer set of apis using bulkWrite().

It uses the same cursors as above but creates the arrays with the bulk operations using the same forEach() cursor method to push each bulk write document to the array. Because write commands can accept no more than 1000 operations, there's need to group operations to have at most 1000 operations and re-intialise the array when the loop hits the 1000 iteration:

var bulkUpdateOps = [];    
db.wholesalers.aggregate([
        { 
            "$project": {
                "distinctBrands": { "$setUnion": [ "$brands", [] ] },
                "hasUniqueBrands": { 
                    "$eq": [
                        { "$size": "$brands" },
                        { "$size": { "$setUnion": [ "$brands", [] ] } }
                    ]
                }
            }
        },
        { "$match": { "hasUniqueBrands": false } }
]).forEach(function(doc){ 
    bulkUpdateOps.push({ 
        "updateOne": {
            "filter": { "_id": doc._id },
            "update": { "$set": { "brands": doc.distinctBrands } }
         }
    });

    if (bulkUpdateOps.length === 1000) {
        db.wholesalers.bulkWrite(bulkUpdateOps);
        bulkUpdateOps = [];
    }
});         

if (bulkUpdateOps.length > 0) { db.wholesalers.bulkWrite(bulkUpdateOps); }
chridam
  • 100,957
  • 23
  • 236
  • 235
0

You can also remove duplicates by applying javascript function to each document (see How to get unique array items):

function unique(arr) {
    var hash = {}, result = [];
    for ( var i = 0, l = arr.length; i < l; ++i ) {
        if ( !hash.hasOwnProperty(arr[i]) ) { 
            hash[ arr[i] ] = true;
            result.push(arr[i]);
        }
    }
    return result;
}

db.wholesalers.find({}).forEach(function(doc){
    db.wholesalers.update({"_id" : doc._id}, {$set: {"brands": unique(doc.brands)} });
})

Of course you can check if unique brands array has less items than original brands list before updating documents etc. But thus its a quick fix for database problem I would not bother with performance tuning.

Community
  • 1
  • 1
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459