3

Lets say I have stream data from the Twitter API, and I have the data stored as documents in the MongoDB. What I'm trying to find is the count of screen_name under entities.user_mentions.

{
    "_id" : ObjectId("50657d5844956d06fb5b36c7"),
    "contributors" : null,
    "text" : "",
    "entities" : {
        "urls" : [ ],
        "hashtags" : [
            {
                "text" : "",
                "indices" : [
                    26,
                    30
                ]
            },
            {
                "text" : "",
                "indices" : []
            }
        ],
        "user_mentions" : [ 
                {
                    "name":"Twitter API", 
                    "indices":[4,15], 
                    "screen_name":"twitterapi", 
                    "id":6253282, "id_str":"6253282"
                }]
    },
    ...

I have attempted to use map reduce:

map = function() {
    if (!this.entities.user_mentions.screen_name) {
        return;
    }

    for (index in this.entities.user_mentions.screen_name) {
        emit(this.entities.user_mentions.screen_name[index], 1);
    }
}

reduce = function(previous, current) {
    var count = 0;

    for (index in current) {
        count += current[index];
    }

    return count;
}

result = db.runCommand({
    "mapreduce" : "twitter_sample",
    "map" : map,
    "reduce" : reduce,
    "out" : "user_mentions"
});

But its not quite working...

chutsu
  • 13,612
  • 19
  • 65
  • 86

1 Answers1

4

Since entities.user_mentions is an array, you want to emit a value for each screen_name in the map():

var map = function() {
    this.entities.user_mentions.forEach(function(mention) {
        emit(mention.screen_name, { count: 1 });
    })
};

Then count the values by unique screen_name in the reduce():

var reduce = function(key, values) {
    // NB: reduce() uses same format as results emitted by map()
    var result = { count: 0 };

    values.forEach(function(value) {
        result.count += value.count;
    });

    return result;
};

Note: to debug your map/reduce JavaScript functions, you can use print() and printjson() commands. The output will appear in your mongod log.

EDIT: For comparison, here is an example using the new Aggregation Framework in MongoDB 2.2:

db.twitter_sample.aggregate(
    // Project to limit the document fields included
    { $project: {
        _id: 0,
        "entities.user_mentions" : 1
    }},

    // Split user_mentions array into a stream of documents
    { $unwind: "$entities.user_mentions" },

    // Group and count the unique mentions by screen_name
    { $group : {
        _id: "$entities.user_mentions.screen_name",
        count: { $sum : 1 }
    }},

    // Optional: sort by count, descending
    { $sort : {
        "count" : -1
    }}
)

The original Map/Reduce approach is best suited for a large data set, as is implied with Twitter data. For a comparison of Map/Reduce vs Aggregation Framework limitations see the related discussion on the StackOverflow question MongoDB group(), $group and MapReduce.

Community
  • 1
  • 1
Stennie
  • 63,885
  • 14
  • 149
  • 175
  • Hats off to you sir. I was wondering can I do the same using the Aggregate Framework? `group()` function? – chutsu Sep 30 '12 at 09:41
  • 1
    @chutsu: You can do the same with the Aggregation Framework, but there are a few caveats such as the results being limited to inline at the moment, and to the maximum document size of 16Mb. It may also be possible to use the `group()` command, but this also has some limitations. For a comparison of the approaches and limitations see: [MongoDB group(), $group and MapReduce](http://stackoverflow.com/questions/12337319/mongodb-group-group-and-mapreduce/12340283#12340283). – Stennie Sep 30 '12 at 10:24
  • 1
    @chutsu: Added an example with the Aggregation Framework for you ;-) – Stennie Sep 30 '12 at 10:33
  • hmmm the aggregate version doesn't seem to return the same values as the map reduce version – chutsu Oct 01 '12 at 12:10
  • @chutsu: How so? I get the same values for my test, but the data structure is slightly different (`aggregate()` uses `result` instead of `results`, and the `count` is at the the top level rather than nested under `value`). – Stennie Oct 01 '12 at 12:16
  • Could it be the 16MB limit with the $group operation? Because I read in "MongoDB in Action By Kyle Banker, p96", group() won’t process more than 10,000 unique keys, which might be the case in my test. But I suppose `group() != $group`... – chutsu Oct 01 '12 at 12:33
  • OMG... @Stennie so sorry, please ignore me I'm an idiot, I was quering the wrong collection... – chutsu Oct 01 '12 at 13:28