Can OLAP be done in BigTable?

Question

In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).

The queries that you run on a table like this are usually of the form (meta-SQL):

SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour

So you get the totals for each hour of the selected day with the mentioned filters. One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.

I'm currently learning the ins and outs of Hadoop and the likes.

Running the above query as a mapreduce on a BigTable looks easy enough: Simply make 'hour' the key, filter in the map and reduce by summing the values.

Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?

If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?

score 9 · Answer 1 · answered Sep 15 '09 at 16:53

9

It's even kind of been done (kind of).

LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg

A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/

answered Sep 15 '09 at 16:53

SquareCog

19,421
8
49
63

1

Thanks for the zohmg sugegstion. According to their website: "The core idea is to pre-compute aggregates and store them in a read-efficient manner". My idea is to start with a set of data and aggregate based on the users needs at that moment. – Niels Basjes Sep 16 '09 at 11:45
You want to preaggregate so that for each unique combination of dimensions you have at most one row; the run-time aggregation is then a question of rolling up the appropriate cross-section of the cube. Zohmg can point the way for you on how to do that. I know of at least one ad network that uses either HyperTable or HBase to do real-time dashboarding for their customers, so it's doable. – SquareCog Sep 16 '09 at 13:54
From the readme: "The code is now wildly out-of-date with the current Hadoop and HBase implementations and is left here to slowly bitwither." – Landon Kuhn Mar 26 '13 at 17:08

score 4 · Answer 2 · edited Apr 26 '14 at 11:02

4

My answer relates to HBase, but applies equally to BigTable.

Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.

Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.

edited Apr 26 '14 at 11:02

answered Jul 24 '12 at 17:47

Suman

9,221
5
49
62

score 4 · Answer 3 · answered Oct 29 '12 at 09:35

4

We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.

http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html

answered Oct 29 '12 at 09:35

Soumyajit Swain

1,298
1
21
35

score 3 · Answer 4 · answered Jul 20 '12 at 07:33

3

Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.

Video: http://www.youtube.com/watch?v=5U3EnfiKs44

Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/

answered Jul 20 '12 at 07:33

Nicolas

277
5
13

score 3 · Answer 5 · answered Aug 09 '12 at 17:17

3

If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.

http://www.youtube.com/watch?v=QI8623HlYd4

It's not MapReduce but it is geared towards high-speed table scan like what you described.

answered Aug 09 '12 at 17:17

overcoil

170
5

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Andy Hayden Sep 29 '12 at 14:17

Can OLAP be done in BigTable?

5 Answers5

Linked