MongoDB: Removing duplicate document based on ObjectId?

Question

This is really an open question. I am sorry if this goes little vague but I am trying to collect thoughts from other people since I am very new to Mongo

Situation

I realized that my collection has multiple duplicate documents (based on name key)
These documents may be same or might got changed during the subsequent dumps from file(we want to keep later changes)
Since there is no insert date, it will be hard to tell looking at document which one is latest (bad schema design)

Wanted

To remove the documents which were inserted earlier
I read that each document in collection is assigned an ObjectId(here) that makes document unique

Question

Is it possible to know which document is inserted earlier based on ObjectId and remove it using Map Reduce?
Any other thoughts and advices?

Given a document, how can you tell if it's a version of another document or a completely new document? — Sergio Tulentsev, Aug 01 '12 at 18:30
I can't , all I can say that it is duplicate if there is more that one document with same `name` key — daydreamer, Aug 01 '12 at 18:31

Sergio Tulentsev · Answer 1 · 2012-08-01T19:00:26.207

I'm bored this evening, so here we go.

Step 1. Let's prepare our test data.

> db.users.insert({name: 'John', other_field: Math.random()})
> db.users.insert({name: 'Bob', other_field: Math.random()})
> db.users.insert({name: 'Mary', other_field: Math.random()})
> db.users.insert({name: 'John', other_field: Math.random()})
> db.users.insert({name: 'Jeff', other_field: Math.random()})
> db.users.insert({name: 'Ivan', other_field: Math.random()})
> db.users.insert({name: 'Mary', other_field: Math.random()})
> db.users.find()
{
    "_id" : ObjectId("501976e9bee9b253265bba8b"),
    "name" : "John",
    "other_field" : 0.9884713875252772
}
{
    "_id" : ObjectId("501976e9bee9b253265bba8c"),
    "name" : "Bob",
    "other_field" : 0.048004131996396415
}
{
    "_id" : ObjectId("501976e9bee9b253265bba8d"),
    "name" : "Mary",
    "other_field" : 0.20415803582615222
}
{
    "_id" : ObjectId("501976e9bee9b253265bba8e"),
    "name" : "John",
    "other_field" : 0.5514446987265585
}
{
    "_id" : ObjectId("501976e9bee9b253265bba8f"),
    "name" : "Jeff",
    "other_field" : 0.8685077449753242
}
{
    "_id" : ObjectId("501976e9bee9b253265bba90"),
    "name" : "Ivan",
    "other_field" : 0.2842514340422925
}
{
    "_id" : ObjectId("501976eabee9b253265bba91"),
    "name" : "Mary",
    "other_field" : 0.984048520281136
}

Step 2. The map-reduce

var map = function() {
  emit(this.name, this);
};

var reduce = function(name, vals) {
  var last_obj = null;
  vals.forEach(function(v) {
    if(!last_obj || v._id > last_obj._id) {
      last_obj = v;
    }
  });
  return last_obj;
};

db.users.mapReduce(map, reduce, {out: 'temp_coll'})

It basically groups all documents by name and then selects the one with the largest _id.

Step 3. Do something with unique data.

> db.temp_coll.find()
{
    "_id" : "Bob",
    "value" : {
        "_id" : ObjectId("501976e9bee9b253265bba8c"),
        "name" : "Bob",
        "other_field" : 0.048004131996396415
    }
}
{
    "_id" : "Ivan",
    "value" : {
        "_id" : ObjectId("501976e9bee9b253265bba90"),
        "name" : "Ivan",
        "other_field" : 0.2842514340422925
    }
}
{
    "_id" : "Jeff",
    "value" : {
        "_id" : ObjectId("501976e9bee9b253265bba8f"),
        "name" : "Jeff",
        "other_field" : 0.8685077449753242
    }
}
{
    "_id" : "John",
    "value" : {
        "_id" : ObjectId("501976e9bee9b253265bba8e"),
        "name" : "John",
        "other_field" : 0.5514446987265585
    }
}
{
    "_id" : "Mary",
    "value" : {
        "_id" : ObjectId("501976eabee9b253265bba91"),
        "name" : "Mary",
        "other_field" : 0.984048520281136
    }
}

For example, drop the original collection, iterate this one and insert values into new collection. Don't forget to drop the temp collection when you're done.

Important

I didn't bother with extraction of a timestamp from objectid, because I assumed that you run your import jobs not twice a second (not even every second, maybe).

Question: will _id will always increase? given Object Id consists of time, machine, pid, inc, all except machine will increase, but machine? — daydreamer, Aug 01 '12 at 18:54
@daydreamer: Unless you mess with your system clock, then yes, it's guaranteed to be monotonically increasing. — Sergio Tulentsev, Aug 01 '12 at 18:55
That sounds good, is there any reference documentation which asserts that ? I would try using your technique and will get back to you, thanks much for your help — daydreamer, Aug 01 '12 at 18:56
Look at @Prasith's answer. ObjectId contains timestamp, which occupies the highest 4 bytes. And [here's the official doc](http://www.mongodb.org/display/DOCS/Object+IDs) — Sergio Tulentsev, Aug 01 '12 at 18:57

score 0 · Answer 2 · answered Aug 01 '12 at 18:40

Ok since object id uses timestamp as it's leading four bytes you can do this with a bit of math.

Thankfully the mongo shell has a way to get the timestamp from an object id by you will need to do some more javascript to first query your documents with the same name then store them in a temp variable (if using the command line) or in a temp table (if using drivers) and parse each individual id's using the timestamp getter that's shown in the link below.

http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs#OptimizingObjectIDs-Extractinsertiontimesfromidratherthanhavingaseparatetimestampfield.

Remember that object id's are only accurate to the second so this still doesn't help in rapid insertion mode.

But either way what you are asking for is doable either in a map reduce function or in the way shown above which does it through the command line.

Give that a shot and if you get stuck let me know. If i know your collection structure i can probably whip up something real quick but only after you bang your head on it a couple of times :)

Just realized this poster also does it using Mongo's extended JSON support. http://stackoverflow.com/questions/10552621/mongodb-extract-timestamp-from-objectid-in-json-query?rq=1 I still think my way is easier but this is also an elegant solution. — Prasith Govin, Aug 01 '12 at 18:42
Is it necessary to extract the date from the ObjectID? Why not just compare the ObjectIDs directly (or sort your query on this _id field)? The later the insertion time, the greater the ObjectID. — Jenna, Aug 01 '12 at 18:48