For relatively small data, you can achieve the above by using the aggregation framework where you create a new array with the distinct brand values by using the $setUnion
operator in the $project
stage. If an array contains duplicate entries, $setUnion
ignores the duplicate entries as well as the order of the elements.
Once you get the new distinct brands field, you will need an additional field for filtering documents in the entire collection i.e it checks whether a given array has unique elements using the same concept with the $setUnion
operator. Use the results array from the aggregated documents to
update your collection by iterating the results cursor using its forEach()
method and updating each document as follows:
db.wholesalers.aggregate([
{
"$project": {
"distinctBrands": { "$setUnion": [ "$brands", [] ] },
"hasUniqueBrands": {
"$eq": [
{ "$size": "$brands" },
{ "$size": { "$setUnion": [ "$brands", [] ] } }
]
}
}
},
{ "$match": { "hasUniqueBrands": false } }
]).forEach(function(doc) {
db.wholesalers.update(
{ "_id": doc._id },
{ "$set": { "brands": doc.distinctBrands } }
)
})
Whilst this is optimal for small collections, performance with large collections is greatly reduced since looping through a large dataset and sending each update operation per request to the server incurs a computational penalty.
The Bulk()
API comes to the rescue and greatly improves performance since write operations are sent to the server only once in bulk. Efficiency is achieved since the method does not send every write request to the server (as with the current update statement within the forEach()
loop) but just once in every 1000 requests, thus making updates more efficient and quicker than currently is.
Using the same concept above with the forEach()
loop to create the batches, we can update the collection in bulk as follows.
In this demonstration the Bulk()
API available in MongoDB versions >= 2.6 and < 3.2
uses the initializeUnorderedBulkOp()
method to execute in parallel, as well as in a nondeterministic order, the write operations in the batches:
var bulk = db.wholesalers.initializeUnorderedBulkOp(),
counter = 0; // counter to keep track of the batch update size
db.wholesalers.aggregate([
{
"$project": {
"distinctBrands": { "$setUnion": [ "$brands", [] ] },
"hasUniqueBrands": {
"$eq": [
{ "$size": "$brands" },
{ "$size": { "$setUnion": [ "$brands", [] ] } }
]
}
}
},
{ "$match": { "hasUniqueBrands": false } }
]).forEach(function(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "brands": doc.distinctBrands }
});
counter++; // increment counter
if (counter % 1000 == 0) {
bulk.execute(); // Execute per 1000 operations
// and re-initialize every 1000 update statements
bulk = db.wholesalers.initializeUnorderedBulkOp();
}
});
The next example applies to the new MongoDB version 3.2
which has since deprecated the Bulk()
API and provided a newer set of apis using bulkWrite()
.
It uses the same cursors as above but creates the arrays with the bulk operations using the same forEach()
cursor method to push each bulk write document to the array. Because write commands can accept no more than 1000 operations, there's need to group operations to have at most 1000 operations and re-intialise the array when the loop hits the 1000 iteration:
var bulkUpdateOps = [];
db.wholesalers.aggregate([
{
"$project": {
"distinctBrands": { "$setUnion": [ "$brands", [] ] },
"hasUniqueBrands": {
"$eq": [
{ "$size": "$brands" },
{ "$size": { "$setUnion": [ "$brands", [] ] } }
]
}
}
},
{ "$match": { "hasUniqueBrands": false } }
]).forEach(function(doc){
bulkUpdateOps.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$set": { "brands": doc.distinctBrands } }
}
});
if (bulkUpdateOps.length === 1000) {
db.wholesalers.bulkWrite(bulkUpdateOps);
bulkUpdateOps = [];
}
});
if (bulkUpdateOps.length > 0) { db.wholesalers.bulkWrite(bulkUpdateOps); }