2

We know that Snowflake is a compressed columnar storage database and tuned to run queries with MPP and auto scaling. We also know that for creating data marts and DW, Kimball and Dimensional modelling(Star Schema) has been in market and practice from decades now. This was a success due to the massive Row store DBs we used to have for our DWs.

So the question here is for creating Data Marts and DW in Snowflake, do we have to follow Kimball ? Does it add any value to the performance, in fact I read that it add overhead for the engine that is already tuned to work on columnar compressed data? Do we still need to use surrogate keys for columns and force to create Facts and Dimensions and star schema, where we can simply join the flat denormalized tables to get the similar or better performance ?

What does superpower DBs like Snowflake recommend from a best practice point of view for Modelling? Is Kimball must to have or redundant as it defeats the purpose of columnar storage benefits ?

I think SAP HANA / Redshift / Big Query or even Azure SQL Datawarehouse, no one recommend this and I could not find anywhere a single line that recommend to use Kimball or star schema. Few does mention that, “It also works for Star schema” which does not mean that Star schema has to be used ?

  • 1
    The dimensional model is not just for performance. It is also a model that business users can easily understand (vs a complicated OLTP model or a big flat model where everything is joined together). Surrogate Keys support SCD2 which is another way of simplifying things for business users. Star Schema is not an overhead from read perspective, it's an overhead from a write perspective, but that goes for any data that is loaded into a columnar compressed table – Nick.Mc Feb 08 '21 at 23:55
  • 1
    Just to add to what @Nick.McDermaid has said, the DM approach also includes important concepts like conformed dimensions - plus the majority of BI tools expect the data to be in the form of facts/dimensions so (assuming you are using such a BI tool) you'd have to model your flatfile as a virtual set of facts and dimensions anyway – NickW Feb 09 '21 at 09:53
  • @Nick.McDermaid 100% agree with you but SCD2 can be handled without surrogate key using a compound key of ID +DateTO(99991231).. do you agree ?? Is that the ONLY benefit of surrogate keys ? – Amit Mahapatra Feb 10 '21 at 09:56
  • @Nick.McDermaid Agree with you on reading .. but reading facts and dims might be confusing for end users with keys. Ex. for sales, instead of giving a sales fact and few dims and asking end users to join and get the data, giving them exactly what they need from the flat tables in a view or may be give them the views of transformed tables(of course cust, country etc.. and let them join it in tableau or power bi to get the details by ID, effort is same. Rather these view joining will be more easy(as keys will be readable than numbers) that surrogate key joining with fact and dims.do you agree ? – Amit Mahapatra Feb 10 '21 at 10:12
  • No end user should know how to join or write SQL. No end user should even need to know what a surrogate key is. When presenting a star schema to a user, surrogate keys should not even be visible. They are used automatically internally for joins by the reporting tool. Yes you can give them a pre joined version of exactly what they need but then you end up in endless permutations of joins and aggregates. You're basically building a large report catalog (which also has it's place). – Nick.Mc Feb 10 '21 at 10:19
  • @NickW its a good point... as said above you can create such dim views and enrich it from multiple tables to create a conformed dims..... but if a transaction table is already having the dim keys .. does it make sense to discard it and force to create a FACtT out of it using surrogate keys..then later again join those surrogate keys to get the Dim keys and additional fields and complicate things rather join it simply with such dim views and transaction table get additional fields... which way u gonna go... – Amit Mahapatra Feb 10 '21 at 10:23
  • 1
    Hi - if you read a book on Dimensional Modelling (https://www.amazon.co.uk/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802) it should answer all your questions. Whether you then think that DM is the appropriate methodology to follow is entirely up to you. However, I would strongly recommend that you follow an industry-standard methodology (whichever one suits your particular needs best) rather than making something up yourself or, even worse, not following any methodology at all – NickW Feb 10 '21 at 10:42
  • 1
    There are a number of reasons that you go to the trouble of creating surrogate keys over using source system keys. Look up SCD2 for starters. – Nick.Mc Feb 10 '21 at 22:09

1 Answers1

1

One thing to keep in mind: Snowflake is a row-oriented, columnar store. That's an important distinction. This means that Snowflake takes advantage of all of the significant compression gains associated with columnar storage but still maintains the row-oriented approach to storing data.

Why does this matter?

With the micro-partition approach, it means we can still eliminate large numbers of rows using query predicates and then interrogate only the column stores within those row groupings that met the query's criteria. So you really get the best of both worlds.

In my opinion, Snowflake can support just about any data model (or partial/hybrid implementation).

Also - "redundant" values in a row-oriented, column store tend to lead to very, VERY good compression.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Rob Silva
  • 76
  • 1