7

I'm just starting out with mongo db and trying to make some simple things. I filled up my database with a collections of data containing the "item" property. I wanted to try to count how much time every item is in the collection

example of a document:

{ "_id" : ObjectId("50dadc38bbd7591082d920f0"), "item" : "Pons", "lines" : 37 }

So I designed these two functions for doing MapReduce (written in python using pymongo)

all_map = Code("function () {"
           "    emit(this.item, 1);"
           "}")

all_reduce = Code("function (key, values) {"
                  " var sum = 0;"
                  " values.forEach(function(value){"
                  "     sum += value;"
                  " });"
                  " return sum;"
                  "}")

This worked like a charm, so I began filling the collection. At around 30.000 documents, the mapreduce already lasts longer than a second... Because NoSQL is bragging about speed I thought I must have been doing something wrong!

A Question here at Stack Overflow made me check out the Aggregation feature of mongodb. So I tried to use the group + sum + sort thingies. Came up with this:

db.wikipedia.aggregate(
 { $group: { _id: "$item", count: { $sum: 1  }  } }, 
 { $sort: {count: 1}  }
)

This code works just fine and gives me the same results as the mapreduce set, but it is just as slow. Am I doing something wrong? Do I really need to use other tools like hadoop to get a better performance?

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Arninja
  • 735
  • 1
  • 12
  • 24
  • 1
    $group cannot use an index and then you are taking the full table scan and sorting on a computed field which again cannot use an index...hmmm yea I think this could easily be as slow and MR, take a look at the notices on $sort: http://docs.mongodb.org/manual/reference/aggregation/#_S_sort. If I am honest I don't think this is the fault of the tool but more of the design of the schema if you need to do a query like this in realtime-ish amount of time – Sammaye Dec 27 '12 at 10:34
  • read this for clarification: http://stackoverflow.com/questions/12015064/mongodb-mapreduce-and-sorting – Dmitry Zagorulkin Dec 27 '12 at 10:55
  • @Sammaye As you read my closing sentences you will notice that I am not raging on the tools. I have no experience yet with NoSQL and MongoDB. I am just asking what is wrong. How I can improve my design to get this thing running the **right** way. – Arninja Dec 27 '12 at 11:36
  • 1
    I wasn't blaming you of rage :) I was stating where I believe the problem is. Hmmmm, an additional pre-aggregative collection is a good place to start, so each time you add a `item` you ping that row (maybe in your app) to another collection where it will upsert this data with a `$inc` operator. That might be the best way without readiing too much into it. Of course this does mean you have two collections to manage but it will be faster and easier to manage them than to make the query you are. – Sammaye Dec 27 '12 at 11:40
  • @Sammaye Well to be honest I found your feedback very constructive so, I just wanted to make sure ;) I was also thinking of a two collection solution for this. It makes some more sense in the NoSQL story then trying to create a more complicated query. And thanks alot for pointing out that sorting on a calculated field is indeed a little slow :) – Arninja Dec 27 '12 at 11:46
  • Sammaye's design suggestion is absolutely correct. If you want real time performance on ANY kind of query at massive scale, you'll have to have the solution precomputed in a separate collection. I tend to have a collection that stores the full item/transaction I'm looking at and with each new "type" of query (counts by minute, related items, etc.) create a new collection. This is an overwhelmingly popular way of getting real time performance and you won't find it in the mongo documents (I think they worry encouraging that kind of denormalization will scare off newcomers). – marr75 Dec 27 '12 at 18:57
  • Also of note: Mongo's speed comes from fire and forget inserts and data localization. Ideally you are taking advantage of the fire and forget inserts/updates to transform your incoming data into the exact form for each collection and then querying one to a few items by their (hopefully contiguous) _ids. – marr75 Dec 27 '12 at 19:29
  • `Hadoop MapReduce` is not about answering realtime queries from a database. It is an offline batch utility. – Thomas Jungblut Dec 27 '12 at 22:08

1 Answers1

9

I will place an answer basically summing up my comments. I cannot speak for other techs like Hadoop since I have not yet had the pleasure of finding time to use them but I can speak for MongoDB.

Unfortunately you are using two of the worst operators for any database: computed fields and grouping (or distinct) on a full table scan. The aggregation framework in this case must compute the field, group and then in-memory ( http://docs.mongodb.org/manual/reference/aggregation/#_S_sort ) sort the computed field. This is an extremely inefficient task for MongoDB to perform, in fact most likely any database.

There is no easy way to do this in real-time in line to your own application. Map reduce could be a way out if you didn't need to return the results immediately but since I am guessing you don't really want to wait for this kind of stuff the default method is just to eradicate the group altogether.

You can do this by pre-aggregation. So you can create another collection of grouped_wikipedia and in your application you manage this using an upsert() with atomic operators like $set and $inc (to count the occurrences) to make sure you only get one row per item. This is probably the most sane method of solving this problem.

This does however raise another problem of having to manage this extra collection alongside the detail collection wikipedia but I believe this to be a unavoidable side effect of getting the right performance here. The benefits will be greater than the loss of having to manage the extra collection.

Sammaye
  • 43,242
  • 7
  • 104
  • 146
  • But if you have a collection with 8 Million entries und you are running constantly map & reduce on it to keep your "cache" up-to-date, does that not slow down your db? – Robert Reiz Jul 15 '14 at 20:14
  • @RobertReiz It could possibly, it dpeends on quite a few factors. I mean you do have the weight of a JS engine, but the JS enigne is no longer single threaded and it can release locks on the db while it does its processing, so the problem is the IO needed to write to the DB once the MR is done, however, if you run a MR that only picks out say, 10,000 rows per 5 mins you will find that MongoDB can quite happily cope with that – Sammaye Jul 15 '14 at 20:52
  • I don't care so much about the client, but the mongodb process. Assume I have 8 Million entries and for each entry MongoDB MR takes 5 mins, then I need 27 days to calculate all my caches. That's far away from real time :-) – Robert Reiz Jul 15 '14 at 21:03
  • @RobertReiz that is the mongod process, the mongod process will fire up its built in JS enigne, v8 in 2.2+. Each entry in your collection will take mongodb 5 mins to calculate? Wait, how, why? – Sammaye Jul 15 '14 at 21:29
  • @RobertReiz bare in mind that it isn't much of an incremental MR job if you run all 8m records again every 5 mins – Sammaye Jul 15 '14 at 21:30
  • I have a collection with 8 Million entries and a MR query which runs on this 8 Million entries. 1 single MR query on the collection takes between 1 sec and 5 min, depending on the parameters. If I execute 1000 of this MR queries to cache the results it will take forever. That's nothing I just braninstormed. That's a real problem I currently try to solve. Your input is highly appreciated. – Robert Reiz Jul 16 '14 at 08:17
  • Currently I index my 8 Million entries with ElasticSearch. I will experiment with ES facets feature and compare the results with Mongos MR. I hope ElasticSearch is performing better. – Robert Reiz Jul 16 '14 at 08:20
  • @RobertReiz yeah you see you are trying to refresh your entire collection. I am unsure as to your scenario here and how the data changes and when the data changes and even why, ion fact I know nothing of your scenario but if ES faceting works for you then it sounds like your trying to get MR to do result faceting which it is not designed to do. There are specific scenarios for certain techs and tools and this is a scenario for an FTS tech – Sammaye Jul 16 '14 at 09:05
  • if preaggregation is not a variant - what solutions could you recommend in scope of this requirements (main metric is performance) – Alex Aug 24 '15 at 16:12