Mongodb distinct aggregation of 3 billion documents

Question

I have a huge collection with 3 billion documents. Each document looks like the following:

"_id" : ObjectId("54c1a013715faf2cc0047c77"),
"service_type" : "JE",
"receiver_id" : NumberLong("865438083645"),
"time" : ISODate("2012-12-05T23:07:36Z"),
"duration" : 24,
"service_description" : "NQ",
"receiver_cell_id" : null,
"location_id" : "658_55525",
"caller_id" : NumberLong("475035504705")

I would like to get the list of distinct users (they should at least appear once as a caller 'caller_id'), their counts (how many times each user appeared in the collection as either caller or receiver) and the count of locations if they are callers (i.e., the count for each location_id per user).

I want to end up with the following:

"number_of_records" : 20,
"locations" : [{location_id: 658_55525, count:5}, {location_id: 840_5425, count:15}],
"user" : NumberLong("475035504705")

I tried the solution described here and here but they are not efficient enough (extremely slow). What would be an efficient way to achieve this?

When you say 'extremely slow' how slow are you talking? How long did you let it run? — Kyle Coventry, Jan 23 '15 at 19:32
Can you please post both the solutions you tried and which one fared better? It would be useful since you may not want to get the same thing that you have tried as an answer. — BatScream, Jan 23 '15 at 19:39
There currently isn't enough information to guess how your resource constraints might be affecting performance. What sort of deployment (standalone/replica/sharded) and server resources (ram/disk/cpu) do you have, and what is the total size of the data? Can you also include the indexes for this collection? Lastly, what specific version of MongoDB and O/S are you using? — Stennie, Feb 01 '15 at 05:00
Would also be useful to know whether you are looking for a one-off solution or if this is an ongoing query requirement. — Stennie, Feb 01 '15 at 05:48
The deployment is standalone, ubuntu server with 32GB ram. The mongo version is 2.6.7. The size of the database is 599.661GB and I have index on caller_id, receiver_id, time, and location_id. The solution is one-off. — amaatouq, Feb 01 '15 at 23:50
I've very surprised you haven't found a solution to this within a 24 hour runtime. — user3467349, Feb 04 '15 at 16:59

Disposer · Accepted Answer · 2015-02-02T10:48:36.830

2

Use aggregation for your result:

db.<collection>.aggregate([
   { $group : { _id : { user:  "$caller_id", localtion: '$location_id'} , count : { $sum : 1}  } },
   { $project : { _id : 0, _id : '$_id.user', location : '$_id.localtion', count : '$count' } },
   { $group : { _id : '$_id', 'locations' : { $push : { location_id : '$location', count : '$count' } }, number_of_records : {$sum : '$count'} } },
   { $project : { _id : 0, user : '$_id', locations : '$locations', number_of_records : '$number_of_records'} },
   { $out : 'outputCollection'},
])

The output will be:

{
    "0" : {
        "locations" : [ 
            {
                "location_id" : "840_5425",
                "count" : 8
            }, 
            {
                "location_id" : "658_55525",
                "count" : 5
            }
        ],
        "number_of_records" : 13,
        "user" : NumberLong(475035504705)
    }
}

Update using allowDiskUse:

var pipe = [
   { $group : { _id : { user:  "$caller_id", localtion: '$location_id'} , count : { $sum : 1}  } },
   { $project : { _id : 0, _id : '$_id.user', location : '$_id.localtion', count : '$count' } },
   { $group : { _id : '$_id', 'locations' : { $push : { location_id : '$location', count : '$count' } }, number_of_records : {$sum : '$count'} } },
   { $project : { _id : 0, user : '$_id', locations : '$locations', number_of_records : '$number_of_records'} },
   { $out : 'outputCollection'},
];

db.runCommand(
   { aggregate: "collection",
     pipeline: pipe,
     allowDiskUse: true
   }
)

edited Feb 02 '15 at 10:48

answered Feb 01 '15 at 18:10

Disposer

6,201
4
31
38

I would like to have a collection with the specified output for all the users. Your solution will have it for one user only (i.e., 475035504705), right? – amaatouq Feb 01 '15 at 23:57
remove the $match line of code. but no db can do this query fast with 3B entry. – Disposer Feb 02 '15 at 02:48
I will try this .. It is OK if it will take few days! as this is a one-off thing! – amaatouq Feb 02 '15 at 03:52
@amaatouq, if its one time thing, add { $out : 'outputCollection'} after {$project ...}, it will put the result in a new collection named 'outputCollection' – Disposer Feb 02 '15 at 08:13
I get the following error: "errmsg" : "exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.", "code" : 16945, "ok" : 0 – amaatouq Feb 02 '15 at 10:43
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/70076/discussion-between-amaatouq-and-disposer). – amaatouq Feb 03 '15 at 01:08
After running the code for few hours I get the following error: "errmsg" : "exception: error opening file \"/home/amaatouq/mongodb_db//_tmp/extsort.707\": errno:24 Too many open files", "code" : 16818, "ok" : 0 – amaatouq Feb 03 '15 at 01:26
You shouldn't need to group and project twice. – user3467349 Feb 04 '15 at 17:12

score 1 · Answer 2 · answered Jan 23 '15 at 22:38

A map-reduce solution would be more suitable here rather than an aggregation pipeline, simply because it avoids two unwinds. If you could bring out an aggregation solution with a single unwind, that would be it. But the below map-reduce solution is one way to do it, though you would need to measure its running time against large data and see if it works for you.

The map function:

var map = function(){
    emit(this.caller_id,
        {locs:[{"location_id":this.location_id,"count":1}]});
}

The reduce function:

var reduce = function(key,values){
    var result = {locs:[]};
    var locations = {};
    values.forEach(function(value){
        value.locs.forEach(function(loc){
                if(!locations[loc.location_id]){
                    locations[loc.location_id] = loc.count;
                }
                else{
                    locations[loc.location_id]++;
                }
        })
    })
    Object.keys(locations).forEach(function(k){
        result.locs.push({"location_id":k,"count":locations[k]});
    })
    return result;
}

The finalize function:

var finalize = function(key,value){
    var total = 0;
    value.locs.forEach(function(loc){
        total += loc.count;
    })
    return {"total":total,"locs":value.locs};
}

Invoking map-reduce:

db.collection.mapReduce(map,reduce,{"out":"t1","finalize":finalize});

Aggregating the result once the map-reduce produces its output.

db.t1.aggregate([
{$project:{"_id":0,
           "number_of_records":"$value.total",
           "locations":"$value.locs","user":"$_id"}}
])

Sample o/p:

{
        "number_of_records" : 3,
        "locations" : [
                {
                        "location_id" : "658_55525",
                        "count" : 1
                },
                {
                        "location_id" : "658_55525213",
                        "count" : 2
                }
        ],
        "user" : 2
}
{
        "number_of_records" : 1,
        "locations" : [
                {
                        "location_id" : "658_55525",
                        "count" : 1
                }
        ],
        "user" : NumberLong("475035504705")
}

The map-reduce java script code should be self explanatory.

thanks @BatScream for your answer. However, I am trying this code and it has been running for 5 days and still running. The method seem to be very slow! — amaatouq, Jan 29 '15 at 18:01

Mongodb distinct aggregation of 3 billion documents

2 Answers2