Remove duplicate values in mongodb

Question

I am learning mongodb using python with tornado.I have a mongodb collection, when I do

db.cal.find()

{     
    "Pid" : "5652f92761be0b14889d9854",
    "Registration" : "TN 56 HD 6766",
    "Vid" : "56543ed261be0b0a60a896c9",
    "Period" : "10-2015",
    "AOs": [
        "14-10-2015",
        "15-10-2015",
        "18-10-2015",
        "14-10-2015",
        "15-10-2015",
        "18-10-2015"
    ],
    "Booked": [
        "5-10-2015",
        "7-10-2015",
        "8-10-2015",
        "5-10-2015",
        "7-10-2015",
        "8-10-2015"
    ],
    "NA": [
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015",
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015"
    ],

    "AOr": [
        "23-10-2015",
        "27-10-2015",
        "23-10-2015",
        "27-10-2015"
    ]
}

I need an operation to remove the duplicate values from the Booked,NA,AOs,AOr. Finally it should be

{
     "Pid" : "5652f92761be0b14889d9854",
      "Registration" : "TN 56 HD 6766",
      "Vid" : "56543ed261be0b0a60a896c9",
      "AOs": [
        "14-10-2015",
        "15-10-2015",
        "18-10-2015",

      ],
      "Booked": [
        "5-10-2015",
        "7-10-2015",
        "8-10-2015",

      ],

      "NA": [
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015",

      ],

      "AOr": [
        "23-10-2015",
        "27-10-2015",

      ]
}

How do I achieve this in mongodb?

score 0 · Answer 1 · answered Dec 22 '15 at 12:12

Working solution

I have created a working solution based on JavaScript, which is available on the mongo shell:

var codes = ["AOs", "Booked", "NA", "AOr"]

// Use bulk operations for efficiency
var bulk = db.dupes.initializeUnorderedBulkOp()

db.dupes.find().forEach(
  function(doc) {

    // Needed to prevent unnecessary operatations
    changed = false
    codes.forEach(
      function(code) {
        var values = doc[code]
        var uniq = []

        for (var i = 0; i < values.length; i++) {
          // If the current value can not be found, it is unique
          // in the "uniq" array after insertion
          if (uniq.indexOf(values[i]) == -1 ){
            uniq.push(values[i])
          }
        }

        doc[code] = uniq

        if (uniq.length < values.length) {
          changed = true
        }

      }
    )

    // Update the document only if something was changed
    if (changed) {
      bulk.find({"_id":doc._id}).updateOne(doc)
    }
  }
)

// Apply all changes
bulk.execute()

Resulting document with your sample input:

replset:PRIMARY> db.dupes.find().pretty()
{
  "_id" : ObjectId("567931aefefcd72d0523777b"),
  "Pid" : "5652f92761be0b14889d9854",
  "Registration" : "TN 56 HD 6766",
  "Vid" : "56543ed261be0b0a60a896c9",
  "Period" : "10-2015",
  "AOs" : [
    "14-10-2015",
    "15-10-2015",
    "18-10-2015"
  ],
  "Booked" : [
    "5-10-2015",
    "7-10-2015",
    "8-10-2015"
  ],
  "NA" : [
    "1-10-2015",
    "2-10-2015",
    "3-10-2015",
    "4-10-2015"
  ],
  "AOr" : [
    "23-10-2015",
    "27-10-2015"
  ]
}

Using indices with `dropDups`

This simply does not work. First, as per version 3.0, this option no longer exists. Since we have 3.2 released, we should find a portable way.

Second, even with dropDups, the documentation clearly states that:

dropDups boolean : MongoDB indexes only the first occurrence of a key and removes all documents from the collection that contain subsequent occurrences of that key.

So if there would be another document which has the same values in one of the billing codes as a previous one, the whole document would be deleted.

You could just use one of the method shown in [Remove Duplicates from JavaScript Array](http://stackoverflow.com/questions/9229645/remove-duplicates-from-javascript-array) to remove the duplicate from those array and then use the `$set` operator with bulk operations to update the documents. Also note that MongoDB 3.2 deprecates Bulk() and its associated methods. — styvane, Dec 27 '15 at 06:57
Neither Jquery nor ecma 6 are available on the shell, right? ;) I fail to see where identifying uniques is inferior. But good point with 3.2, I'll add the soltion with that, too. — Markus W Mahlberg, Dec 27 '15 at 08:04

styvane · Answer 2 · 2015-12-27T06:49:53.477

You can't use the "dropDups" syntax here first because it has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0 and will not even work.

To remove the duplicate from each list you need to use the set class in python.

import pymongo


fields = ['Booked', 'NA', 'AOs', 'AOr']
client = pymongo.MongoClient()
db = client.test
collection = db.cal
bulk = colllection.initialize_ordered_op()
count = 0
for document in collection.find():
    update = dict(zip(fields, [list(set(document[field])) for field in fields])) 
    bulk.find({'_id': document['_id']}).update_one({'$set': update})
    count = count + 1
    if count % 200 == 0:
        bulk.execute()
        bulk = colllection.initialize_ordered_op()

if count > 0:
    bulk.execute()

MongoDB 3.2 deprecates Bulk() and its associated methods and provides the .bulkWrite() method. This method is available from Pymongo 3.2 as bulk_write(). The first thing to do using this method is to import the UpdateOne class.

from pymongo import UpdateOne


requests = [] # list of write operations
for document in collection.find():
    update = dict(zip(fields, [list(set(document[field])) for field in fields])) 
    requests.append(UpdateOne({'_id': document['_id']}, {'$set': update}))
collection.bulk_write(requests)

The two queries give the same and expected result:

{'AOr': ['27-10-2015', '23-10-2015'],
 'AOs': ['15-10-2015', '14-10-2015', '18-10-2015'],
 'Booked': ['7-10-2015', '5-10-2015', '8-10-2015'],
 'NA': ['1-10-2015', '4-10-2015', '3-10-2015', '2-10-2015'],
 'Period': '10-2015',
 'Pid': '5652f92761be0b14889d9854',
 'Registration': 'TN 56 HD 6766',
 'Vid': '56543ed261be0b0a60a896c9',
 '_id': ObjectId('567f808fc6e11b467e59330f')}

Jorge Neves · Answer 3 · 2015-12-22T11:16:32.893

-1

have you tried "Distinct()" ?

Link: https://docs.mongodb.org/v3.0/reference/method/db.collection.distinct/

Specify Query with distinct

The following example returns the distinct values for the field sku, embedded in the item field, from the documents whose dept is equal to "A":

db.inventory.distinct( "item.sku", { dept: "A" } )

The method returns the following array of distinct sku values:

[ "111", "333" ]

edited Dec 22 '15 at 11:16

answered Dec 22 '15 at 11:10

Jorge Neves

1
1

Which would not reduce the data saved. – Markus W Mahlberg Dec 22 '15 at 11:16

thepaks · Answer 4 · 2015-12-22T11:59:36.727

-1

Assuming that you want to remove duplicate dates from the collection, so you can add a unique index with the dropDups: true option:

db.bill_codes.ensureIndex({"fieldName":1}, {unique: true, dropDups: true})

For more reference: db.collection.ensureIndex() - MongoDB Manual 3.0

Note: Back up your database first in case it doesn't do exactly as you're expecting.

edited Dec 22 '15 at 11:59

answered Dec 22 '15 at 11:19

thepaks

203
1
7

That would only drop other documents which have the exact same value for one of the fields. – Markus W Mahlberg Dec 22 '15 at 11:22
i get the error:{ "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "errmsg" : "exception: bad index key pattern { Registration: \"TN 56 HD 6766\", Pid: \"5652f92761be0b14889d9854\" }: Unknown index plugin 'TN 56 HD 676 '", "code" : 67, "ok" : 0 } – Maari Dec 22 '15 at 11:34
Instead of name and nodes into criteria, you have to mention your collection key index. – thepaks Dec 22 '15 at 11:37
1

This is not only outdated, it is plainly wrong and would be outright dangerous if not for the suggestion for taking a backup. The deprecated `dropDups` drops all *documents* which happen to have the same values for a key in the index, not duplicate values. – Markus W Mahlberg Dec 22 '15 at 12:15

Remove duplicate values in mongodb

4 Answers4

Working solution

Using indices with dropDups

Using indices with `dropDups`