14

We are working on a project that will feature real-time statistics of some actions (e.g. clicks). On every click, we will log information like date, age and gender (these come from Facebook), location, etc.

We are discussing about the best place to store these information and use them for real-time stats. We will display aggregate statistics: for example, number of clicks, number of clicks made by male/female, number of clicks divided by age groups (e.g. 18-24, 24-30...).

Since on the site we are using MongoDB everywhere, my colleague thought we should store statistics inside it as well. I, however, would prefer a SQL-based database for this task, like MySQL (or maybe Drizzle), because I believe SQL is better when doing operations like data aggregation. Although there's the overhead of parsing the SQL, I think MySQL/Drizzle may actually be faster than No-SQL databases here. And inserts are not slow too, when using INSERT DELAYED queries.

Please note that we do not need to perform JOINS or collect data from multiple tables/collections. Thus, we don't care if the database is different. However, we do care about scalability and reliability. We are building something that will (hopefully) become very big, and we've designed every single line of code with scalability in mind.

What do you think about this? Is there any reason to prefer MongoDB over MySQL/Drizzle for this? Or is it indifferent? Which one would you use, if you were us?

Thank you, Alessandro

Chintan Pathak
  • 302
  • 2
  • 5
  • 20
ItalyPaleAle
  • 7,185
  • 6
  • 42
  • 69
  • Just as a suggestion -> have a look at RddTool http://www.mrtg.org/rrdtool/ it might be usefull – cristian Apr 29 '11 at 10:35
  • Thanks, but this is not the kind of statistics we are looking for! We will have something more like YouTube's video statistics...Something like: http://www.reelseo.com/wp-content/uploads/2010/06/youtube-video-stats.png – ItalyPaleAle Apr 29 '11 at 10:47

2 Answers2

13

So BuddyMedia is using some of this. The Gilt Groupe has done something pretty cool with Hummingbird (node.js + MongoDB).

Having worked for a large online advertiser in the Social Media space, I can attest that real-time reporting is really a pain. Trying to "roll-up" 500M impressions a day is already a challenge, but trying to do it real time worked, but it carried some significant limitations. (like it was actually delayed by 5-minutes :)

Frankly, this type of problem is one of the reasons I started using MongoDB. And I'm not the only one. People are using MongoDB for all kinds of real-time analytics: server monitoring, centralized logging, as well as dashboard reporting.

The real key when doing this type of reporting is to understand that the data structure is completely different with MongoDB, you're going to avoid "aggregation" queries, so the queries and the output charts are going to be different. There's some extra coding work on the client side.

Here's the key that may point you in the right direction for doing this with MongoDB. Take a look at the following data structure:

{
  date: "20110430",
  gender: "M",
  age: 1, // 1 is probably a bucket
  impression_hour: [ 100, 50, ...], // 24 of these
  impression_minute: [ 2, 5, 19, 8, ... ], // 1440 of these
  clicks_hour: [ 10, 2, ... ],
  ...
}

There are obviously some tweaks here, appropriate indexes, maybe mushing data+gender+age into an _id. But that's kind of the basic structure of click analytics with MongoDB. It's really easy to update impression and clicks { $inc : { clicks_hour.0 : 1 } }. You get to update the whole document atomically. And it's actually pretty natural to report on. You already have your an array containing your hourly or minute-level data points.

Hopefully that's points you in the right direction.

Gates VP
  • 44,957
  • 11
  • 105
  • 108
  • how would you aggregate that then for stats? – Toby Dec 13 '12 at 00:25
  • The best way to aggregate a large amount of data is to use some form of Map / Reduce framework (think Hadoop). If you are tracking something like a click, you would first validate the click, then you would ping the real-time counters, then you would pass off the click data to the M/R system for full aggregation. This real-time stuff only works if you know what you want in advance. – Gates VP Dec 13 '12 at 19:04
4

MongoDB is great for this kind of thing and will certainly be faster than MySQL will be, although don't underestimate how powerful MySQL can be - many companies have built analytics tools with it.

Have a look at this presentation by Patrick Stokes of BuddyMedia on how they used MongoDB for their analytic system.

http://www.slideshare.net/pstokes2/social-analytics-with-mongodb

Bryan Migliorisi
  • 8,982
  • 4
  • 34
  • 47
  • Actually, I'm the one who was on the "MySQL-side" :) My colleague was the one who voted for MongoDb... Anyway, thanks for those slide. I ended up understanding we are doing it wrong! We will need to re-think the structure of our collection. – ItalyPaleAle Apr 30 '11 at 18:22