First steps for OLAP within BigData world

Question

First of all I may be misinformed about BigData capability nowadays. So, don't hesitate to correct me if I'm too optimistic.

I usually work with regular KPIs, like show me: count of new clients where they meets certain complex conditions (joining few fact tables) for every manager during certain month.

These requests are quite dynamical, thus there is no way to predict pre-calculated data. We use OLAP and MDX for dynamic reporting. The price of dynamic calculating is performance. Users usually wait for the result more than a minute.

Here I got to BigData. I've read some articles, forums, docs leading me to ambiguous conclusions. BigData provides tools to handle data in seconds, however it doesn't fit BI tasks well, like joins, pre-agreggation. There is no the classical DWH over the hadoop concept and so on.

Nonetheless, It's a theory. I've found Kylin which makes me give a try it practically. The more I'm digging, the more questions appear. Some of them:

Do I need any programing knowledge (Java, Scala, Python)?
Do I need graphical tools, ssh access is enough?
What hardware requirements meet my needs for 100-200 gigabytes DBs (also number of hardware)?
What's the best filesystem (ext4), should I care at all?
How can I migrate data from RDBMS, is there any smart ETLs?
What technologies should I learn and use first (pig, spark, etc)?

Actually I might ask wrong questions and totally misunderstand the conception, but hoping for some good leads. Feel free to give any advice you consider useful about the BI and Bigdata consolidation.

I know about http://kylin.apache.org/docs15/index.html But I don't feel comfortable to try it without backend backgroung.

score 2 · Answer 1 · answered Apr 30 '18 at 12:22

If you are familiar with Apache Spark that's also a good start. At ActiveViam we use Spark for Big Data processing and we also needed to do interactive OLAP queries on the same data. So we made an extension called Sparkube that exposes a Spark dataset as a multidimensional cube.

Once your dataset is exposed that way, you get access to all the OLAP MDX functions directly on top of Spark, without moving the data, without deploying software, without configuration, directly from Excel or Tableau.

Here is for instance how you mount the content of a CSV file in memory and expose it as a multidimensional cube:

// Load a file into a Spark dataset.
// Here we load a CSV file, get the column names from
// the CSV headers, and use Spark automatic type inference.
var ds = spark.read
  .format("csv")
  .option("header","true")
  .option("inferSchema","true")
  .load("path/to/file.csv")

// Import the sparkube library (the sparkube jar must be in the classpath)
import com.activeviam.sparkube._

// Expose the dataset as a multidimensional cube.
// You can start visualizing the cube right away at http://localhost:9090/ui
// or connect from Excel with the XMLA protocol at http://localhost:9090/xmla
new Sparkube().fromDataset(ds)
  .withName("My first cube")
  .expose()

The cube structure is infered from the Spark dataset. This way you can jump from Spark to OLAP instantly. Dataset columns become single level hierarchies and distinct count measures, numerical columns become measures (SUM, AVG, MIN, MAX, STD...) — Antoine CHAMBILLE, May 04 '18 at 09:42
By the way a live demo of sparkube has just been recorded at Devoxx FR ( https://www.youtube.com/watch?v=CbfQiMhfaJk ). Although it's in French you can see how it works. — Antoine CHAMBILLE, May 04 '18 at 09:50

score 1 · Answer 2 · answered Sep 25 '16 at 00:45

Apache Kylin is the right tool given you are looking for multi-dimensional analysis. It provides pre-calculation of joins and aggregations, so SQL/MDX queries can come back in no more than a few seconds.

To use Apache Kylin, you have two roles to play: Admin and Analyst. As an admin, you need to prepare a Hadoop cluster and deploy Kylin on it. That requires knowledge about Hadoop and Linux shells. The size and hardware of the cluster depends on your data volume.

Once installed, you as an analyst can build model, cube, and run SQL in Kylin. That requires knowledge about relational model, OLAP, and SQL. No programming is required. Kylin supports ODBC/JDBC interface, you can connect familiar BI tools to visualize data in Kylin.

Typically first time user will try Apache Kylin in a Hadoop sandbox. That shields away many Hadoop complexities and saves time.

score 1 · Answer 3 · answered Feb 01 '18 at 12:15

1

Bigdata means large amount of data.you can procced any type of data by the help of bigdata hadoop.But OLAP is usually worked on smaller data. OLAP access to Hadoop datasets utilizing Hive and HBase.

answered Feb 01 '18 at 12:15

Ruchi Chouhan

11
1

Looks more like a comment. – Sunil Feb 01 '18 at 12:40

First steps for OLAP within BigData world

3 Answers3