13

Given the following star schema tables.

  • fact, two dimensions, two measures.

#   geog_abb  time_date amount     value
#1:       AL 2013-03-26  55.57 9113.3898
#2:       CO 2011-06-28  19.25 9846.6468
#3:       MI 2012-05-15  94.87 4762.5398
#4:       SC 2013-01-22  29.84  649.7681
#5:       ND 2014-12-03  37.05 6419.0224
  • geography dimension, single hierarchy, 3 levels in hierarchy.

#   geog_abb  geog_name geog_division_name geog_region_name
#1:       AK     Alaska            Pacific             West
#2:       AL    Alabama East South Central            South
#3:       AR   Arkansas West South Central            South
#4:       AZ    Arizona           Mountain             West
#5:       CA California            Pacific             West
  • time dimension, two hierarchies, 4 levels in each.

#    time_date time_weekday time_week time_month time_month_name time_quarter time_quarter_name time_year
#1: 2010-01-01       Friday         1          1         January            1                Q1      2010
#2: 2010-01-02     Saturday         1          1         January            1                Q1      2010
#3: 2010-01-03       Sunday         1          1         January            1                Q1      2010
#4: 2010-01-04       Monday         1          1         January            1                Q1      2010
#5: 2010-01-05      Tuesday         1          1         January            1                Q1      2010

Examples is stripped of surrogate keys to improve readability. In results there are levels in hierarchy without other attributes, just don't bother that, they are still levels in hierarchy.


In star schema expressed as:

         GEOGRAPHY (all fields)
        /
       /
   FACT
       \
        \
         TIME (all fields)

In snowflake schema expressed as:

                    geog_region_name
                   /
                  geog_division_name
                 /
                geog_abb (+ geog_name)
               /
              /
          FACT
              \
               \
                time_date
                   |
hierarchies:       |
        weekly    / \    monthly
                 /   \ 
                /     \
   time_weekday         time_month (+ time_month_name)
            |             |
            |             |
        time_week     time_quarter (+ time_quarter_name)
            |             |
            |             |
        time_year      time_year

How would you call the following schema

Does it have any specific name? Starflake? :)

        |>-- geog_region_name
        |
        |>-- geog_division_name
        |
        |>-- geog_abb (+ geog_name)
        |
        |
        geography base
       /
      /
  FACT
      \
       \
        time base
        |
        |
        |>-- time_date
        |
        |>-- time_weekday
        |
        |>-- time_week
        |
        |>-- time_month (+ time_month_name)
        |
        |>-- time_quarter (+ time_quarter_name)
        |
        |>-- time_year

It basically has a dimension base table storing identities of every level of every hierarchy within a dimension. No need for recursive walk through snowflake's levels, potentially less joins. Data still well normalized, only keys are denormalized into base table. All levels from all hierarchies tied to lowest grain key of a dimension in dimension base.
Additionally having a dimension base table allows to handle time variant attributes/temporal queries just in that table, at the granularity of a hierarchy level.

Here is the tabular representation.

Still on natural keys!

  • fact

#    geog_abb  time_date amount     value
# 1:       AK 2010-01-01 154.43 12395.472
# 2:       AK 2010-01-02  88.89  6257.639
# 3:       AK 2010-01-03  81.74  7193.075
# 4:       AK 2010-01-04 165.87 11150.619
# 5:       AK 2010-01-05   8.75  6953.055
  • time dimension base

#     time_date time_year time_quarter time_month time_week time_weekday
# 1: 2010-01-01      2010            1          1         1       Friday
# 2: 2010-01-02      2010            1          1         1     Saturday
# 3: 2010-01-03      2010            1          1         1       Sunday
# 4: 2010-01-04      2010            1          1         1       Monday
# 5: 2010-01-05      2010            1          1         1      Tuesday
  • time dimension normalization to hierarchy levels

#    time_year
# 1:      2010
# 2:      2011
# 3:      2012
# 4:      2013
# 5:      2014

#    time_quarter time_quarter_name
# 1:            1                Q1
# 2:            2                Q2
# 3:            3                Q3
# 4:            4                Q4

#    time_month time_month_name
# 1:          1         January
# 2:          2        February
# 3:          3           March
# 4:          4           April
# 5:          5             May

#    time_week
# 1:         1
# 2:         2
# 3:         3
# 4:         4
# 5:         5

#    time_weekday
# 1:       Friday
# 2:       Monday
# 3:     Saturday
# 4:       Sunday
# 5:     Thursday

#     time_date time_week time_weekday time_year
# 1: 2010-01-01         1       Friday      2010
# 2: 2010-01-02         1     Saturday      2010
# 3: 2010-01-03         1       Sunday      2010
# 4: 2010-01-04         1       Monday      2010
# 5: 2010-01-05         1      Tuesday      2010
  • geography dimension base

#    geog_abb geog_region_name geog_division_name
# 1:       AK             West            Pacific
# 2:       AL            South East South Central
# 3:       AR            South West South Central
# 4:       AZ             West           Mountain
# 5:       CA             West            Pacific
  • geography dimension normalization to hierarchy levels

#    geog_region_name
# 1:    North Central
# 2:        Northeast
# 3:            South
# 4:             West

#    geog_division_name
# 1: East North Central
# 2: East South Central
# 3:    Middle Atlantic
# 4:           Mountain
# 5:        New England

#    geog_abb  geog_name geog_division_name geog_region_name
# 1:       AK     Alaska            Pacific             West
# 2:       AL    Alabama East South Central            South
# 3:       AR   Arkansas West South Central            South
# 4:       AZ    Arizona           Mountain             West
# 5:       CA California            Pacific             West

Dimension base could store also primary key's attributes, this would de-duplicate dimension's lowest level but will be less consistent (time_date levels from both hierarchies would fit into time dimension base tables).


What drawbacks such schema would have? I don't much bother about speed of joins and aggregates, and a query tool adaptivity.
Does it have any name? It is being use? If not why?

philipxy
  • 14,867
  • 6
  • 39
  • 83
jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • > "What drawbacks such schema would have? I don't much bother about speed of joins and aggregates, and a query tool adaptivity." <-- what do you care about then? And why don't you just use the star schema? – Mehdi Feb 22 '16 at 10:42
  • @mef memory. To isolate components of the model for more precise processing. All hierarchy levels of a dimension forms a factless fact a *dimension base*. All reference data are normalized, which means faster re-use on the reports. – jangorecki Feb 22 '16 at 12:01
  • this model will work only if you don't have aggregated tables, otherwise you will need a new dimension relationship table for each level of aggregation, quite annoying and no real value added – mucio Feb 24 '16 at 11:41
  • @mucio this is also true for regular star schema, isn't it? as you don't store higher level keys in a fact. – jangorecki Feb 24 '16 at 11:58
  • 1
    In a star schema you need a shrunken dimension (that can be a view with `distinct`) to join with an aggregated table, in your case you will need an additional join to have this shrunken dimension (How do I get from division to Region? distinct from Base table join Region table, in Star Schema you just will do distinct regions from geo). Anyway it seems to me that you store twice the same information (in base and lowest level of hierarchy); start schema store more data, but less joins, snowflake less data, more joins. – mucio Feb 24 '16 at 22:14
  • @mucio so it is generally a compromise between star schema and snowflake schema... data.table can use index on join since [1dae480](https://github.com/Rdatatable/data.table/commit/1dae480f244127d3de7e419271815e23992daf11) so joins can be now real-time. Even without that feature, the extra join you mentioned would takes places within dimensions, not touching facts, so won't impact speed much. – jangorecki May 01 '16 at 18:57

2 Answers2

4

You are building a snowflake schema with shortcuts.

It's used and BI tools can easily use the shortcuts.

You can also have shortcuts from a parent level of a dimension to a fact table at child level for that dimension. It works, you can skip a join, but you need to store an additional column in the fact table.

The only concern is about data integrity, if a parent-child relationship changes you need to update not only the child table, but also all other tables where this relationship is stored.

It's not a big deal if you generate every time your dimension table from your normalize data, but you need to be careful, even more if you store a parent ID in the fact table.

mucio
  • 7,014
  • 1
  • 21
  • 33
4

What you are doing is not a snowflake schema ...it is similar to "Data Vault" and our own variation "Link-Model". It essentially creates link tables just containing keys which sit between Fact tables and Dim tables (and other Dim tables). Although, we describe them as entity tables and measure tables.

The advantages are

  • You can parallel load dimension and fact tables, then populate the link tables
  • Complicated practices like "as at reporting" with "Adjustments" as found in Insurance can be handled quite readily
  • It is more intuitive to split slowly and quickly changing dimension Dimension tables that are just linked by the link tables. This is a time saving.
  • Adding new dimensions to fact tables is fairly simple and quick, after all it is just adding an extra integer column to a table containing just integers.
  • Factless facts are far more intuitive than in a conventional schema. You can create relationships between dimensions, without any fact record.

The downsides are

  • A slightly more complicated schema structure, so we generally create a Kimball models on top of the "Link Model", as business users tend to understand it well.
  • To add a new dimension to a fact table or to extend a dimension table can be easily done, but the schema can become cluttered over time.
Marcus D
  • 1,074
  • 1
  • 12
  • 27