2

i have a large mongodb collection with a lot of duplicate inserts like this

{ "_id" : 1, "val" : "222222", "val2" : "37"}
{ "_id" : 2, "val" : "222222", "val2" : "37" }
{ "_id" : 3, "val" : "222222", "val2" : "37" }
{ "_id" : 4, "val" : "333333", "val2" : "66" }
{ "_id" : 5, "val" : "111111", "val2" : "22" }
{ "_id" : 6, "val" : "111111", "val2" : "22"  }
{ "_id" : 7, "val" : "111111", "val2" : "22"  }
{ "_id" : 8, "val" : "111111", "val2" : "22"  }

i want to count all duplicates for each insert and only leave one unique entry with the count number in DB like this

{ "_id" : 1, "val" : "222222", "val2" : "37", "count" : "3"}
{ "_id" : 2, "val" : "333333", "val2" : "66", "count" : "1"}
{ "_id" : 2, "val" : "111111", "val2" : "22", "count" : "4" }

i already checked out MapReduce and aggregation framework but they never output the full document back and only do one calculation for full collection

it would be good to save the new data to a new collection

  • possible duplicate of [Find all duplicate documents in a MongoDB collection by a key field](http://stackoverflow.com/questions/9491920/find-all-duplicate-documents-in-a-mongodb-collection-by-a-key-field) – Christian P Jun 04 '14 at 13:25
  • Better if you show us your tries. – kranteg Jun 04 '14 at 14:15

2 Answers2

2

If you use mongodb 2.6, here is an example with the aggregation framework :

db.duplicate.aggregate({$group:{_id:"$val",count:{$sum :1}}},
                       {$project:{_id:0, val:"$_id", count:1}},
                       {$out:"deduplicate"})
  1. group with val and count

  2. project to rename _id field and mask _id field

  3. out to write to a new collection (here the name is deduplicate)

Hope it fit with your case.

kranteg
  • 931
  • 1
  • 7
  • 15
0

Might be easier with an incremental map reduce

mapper=function(){
    emit({'val1':this.val, 'val2':this.val2}, {'count':1});
}
reducer=function(k,v){
    counter=0;
    for (i=0;i<v.length;i++){
        counter+=v[i].count;
    }
    return {'count':counter}
}

Then in the shell you'll need to do

bigcollection.map_reduce(mapper, reducer, {out:{reduce:'reducedcollection'}})

This should result in a new collection called reduced collection. Your values will be the IDs and the count will be there. Note the use of two values as the key in your new collection. If you want to find a specific instance you can do:

reducedcollection.findOne({'id.val1':'33333', 'id.val2':'22'})

The interesting thing happens is that you can now drop the old collection and as new data comes in, map reduce it on top of the reducedcollection and you'll increment the counts.

Might be handy?

Malcolm Murdoch
  • 1,075
  • 6
  • 9
  • thx this is great but how can i pass my other values directly to the new document here ? - it only save val and count – user3664804 Jun 04 '14 at 20:30