mongodb count and remove duplicate values

Question

i have a large mongodb collection with a lot of duplicate inserts like this

{ "_id" : 1, "val" : "222222", "val2" : "37"}
{ "_id" : 2, "val" : "222222", "val2" : "37" }
{ "_id" : 3, "val" : "222222", "val2" : "37" }
{ "_id" : 4, "val" : "333333", "val2" : "66" }
{ "_id" : 5, "val" : "111111", "val2" : "22" }
{ "_id" : 6, "val" : "111111", "val2" : "22"  }
{ "_id" : 7, "val" : "111111", "val2" : "22"  }
{ "_id" : 8, "val" : "111111", "val2" : "22"  }

i want to count all duplicates for each insert and only leave one unique entry with the count number in DB like this

{ "_id" : 1, "val" : "222222", "val2" : "37", "count" : "3"}
{ "_id" : 2, "val" : "333333", "val2" : "66", "count" : "1"}
{ "_id" : 2, "val" : "111111", "val2" : "22", "count" : "4" }

i already checked out MapReduce and aggregation framework but they never output the full document back and only do one calculation for full collection

it would be good to save the new data to a new collection

possible duplicate of [Find all duplicate documents in a MongoDB collection by a key field](http://stackoverflow.com/questions/9491920/find-all-duplicate-documents-in-a-mongodb-collection-by-a-key-field) — Christian P, Jun 04 '14 at 13:25

score 2 · Accepted Answer · answered Jun 04 '14 at 14:19

2

If you use mongodb 2.6, here is an example with the aggregation framework :

db.duplicate.aggregate({$group:{_id:"$val",count:{$sum :1}}},
                       {$project:{_id:0, val:"$_id", count:1}},
                       {$out:"deduplicate"})

group with val and count
project to rename _id field and mask _id field
out to write to a new collection (here the name is deduplicate)

Hope it fit with your case.

answered Jun 04 '14 at 14:19

kranteg

931
1
7
15

ok but how i can pass other values directly into the new document – user3664804 Jun 04 '14 at 15:28
Other values ? Give me an example, it will be easier. – kranteg Jun 04 '14 at 15:30
well lets say i have a value "val2" next to the "val", so how i can pass it to the new document – user3664804 Jun 04 '14 at 15:50
And val2 is also duplicate (as val) or not ? Question is not very clear, easier to work with a real set of datas. – kranteg Jun 04 '14 at 15:55
yes its same duplicate as val, i updated my post to make things clear – user3664804 Jun 04 '14 at 16:08
You just have to change your group criteria and project criteria but it should work. I will test and do the changes in the answer when i m free. – kranteg Jun 04 '14 at 16:26
Can you comment with your solution with $last, i will add it to the answer with my solution. – kranteg Jun 06 '14 at 15:16

Malcolm Murdoch · Answer 2 · 2014-06-05T13:03:52.650

Might be easier with an incremental map reduce

mapper=function(){
    emit({'val1':this.val, 'val2':this.val2}, {'count':1});
}
reducer=function(k,v){
    counter=0;
    for (i=0;i<v.length;i++){
        counter+=v[i].count;
    }
    return {'count':counter}
}

Then in the shell you'll need to do

bigcollection.map_reduce(mapper, reducer, {out:{reduce:'reducedcollection'}})

This should result in a new collection called reduced collection. Your values will be the IDs and the count will be there. Note the use of two values as the key in your new collection. If you want to find a specific instance you can do:

reducedcollection.findOne({'id.val1':'33333', 'id.val2':'22'})

The interesting thing happens is that you can now drop the old collection and as new data comes in, map reduce it on top of the reducedcollection and you'll increment the counts.

Might be handy?

thx this is great but how can i pass my other values directly to the new document here ? - it only save val and count — user3664804, Jun 04 '14 at 20:30

mongodb count and remove duplicate values

2 Answers2