0

What is a reliable and efficient way to aggregate small data in MongoDB?

Currently, my data that needs to be aggregated is under 1 GB, but can go as high as 10 GB. I'm looking for a real time strategy or near real time (aggregation every 15 minutes).

It seems like the likes of Map/Reduce, Hadoop, Storm are all over kill. I know that triggers don't exist, but I found this one post that may be ideal for my situation. Is creating a trigger in MongoDB an ideal solution for real time small data aggregation?

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
krikara
  • 2,395
  • 10
  • 37
  • 71
  • Have you considered mongodb's build-in [aggregation pipeline](http://docs.mongodb.org/manual/core/aggregation-pipeline/) framework? – Nikolay Manolov Jan 10 '14 at 09:09
  • I'm quite new to this. How would I make it real time? – krikara Jan 10 '14 at 09:13
  • I didn't quite understand your question at first. Have a look at [this](http://stackoverflow.com/questions/5807618/strategies-for-real-time-aggregations-in-mongodb?rq=1) question. He has a similar problem – Nikolay Manolov Jan 10 '14 at 09:53
  • @NikolayManolov Yes, I have seen that question already. My question is different however because I am looking for small data strategies. His list of recommendations is inefficient for my work. – krikara Jan 10 '14 at 09:56

1 Answers1

1

MongoDB has two built-in options for aggregating data - the aggregation framework and map-reduce.

The aggregation framework is faster (executing as native C++ code as opposed to a JavaScript map-reduce job) but more limited in the sorts of aggregations that are supported. Map-reduce is very versatile and can support very complex aggregations but is slower than the aggregation framework and can be more difficult to code.

Either of these would be a good option for near real time aggregation.

One further consideration to take into account is that as of the 2.4 release the aggregation framework returns a single document containing its results and is therefore limited to returning 16MB of data. In contrast, MongoDB map-reduce jobs have no such limitation and may output directly to a collection. In the upcoming 2.6 release of MongoDB, the aggregation framework will also gain the ability to output directly to a collection, using the new $out operator.

Based on the description of your use case, I would recommend using map-reduce as I assume you need to output more than 16MB of data. Also, note that after the first map-reduce run you may run incremental map-reduce jobs that run only on the data that is new/changed and merge the results into the existing output collection.

As you know, MongoDB doesn't support triggers, but you may easily implement triggers in the application by tailing the MongoDB oplog. This blog post and this SO post cover the topic well.

Community
  • 1
  • 1
Jon Rangel
  • 384
  • 2
  • 6
  • I don't need map reduce, just the aggregation framework. What I am confused about is how to run it real time. A simple example is I have 100,000 entries about 5 KB. From each entry, I want to pull out the car license # and owner; create a new file with all the licenses and names. But I want this to be in real time, and don't quite understand how to execute this. – krikara Jan 10 '14 at 15:22
  • @krikara, thanks for the clarifications. I have edited my above answer based on this extra info. – Jon Rangel Jan 14 '14 at 14:15
  • +1 For the update. Now, from what I've been reading off the web, it seems Map Reduce may be slow for my job. In reality, all the data inserted into my MongoDB is meta data. Then I need to process that data to create reports. Should I be using MongoDB to aggregate that data, or should I be sending all the meta data to a different server to process it there? If the latter case is true, what should I be using to aggregate (Javascript, Storm)? While each entry is roughly 5 KB, the amount of data processed in one day is around 5 GB. Thanks for the help ~ – krikara Jan 14 '14 at 14:42
  • If you use incremental map-reduce jobs (as described in the link I gave above) that only operate on the data that has changed, then I think MongoDB map-reduce should be suitable as a near real time aggregation mechanism. e.g. if you schedule a MR job every 15 minutes to crunch the documents that have changed during that time. I would recommend doing some performance testing using the scale numbers that you mention above. If you really need true real-time aggregation, then you may be better off looking at something like Storm to process the data. – Jon Rangel Jan 23 '14 at 12:21