4

I am new to dimensional modeling and have read a lot of material (star-schema, dimension/fact tables, SCD, Ralph Kimball's - The Data Warehouse Toolkit book, etc). So I have a good conceptual understanding of dimensional modeling constructs but finding it hard to apply to a usecase due to lack of experience and need some guidance.

Consider Twitter for example, I want to design a dimensional model to calculate -

  1. DAU (Daily active users) = number of users who logged in and accessed twitter via website or mobile app on a given day
  2. MAU (Monthly active users) = number of users who logged in and accessed twitter via website or mobile app in last 30 days including measurement date
  3. User engagement = total(clicks + favorites + replies + retweets) on a tweet

These metrics over a period (like month) is the summation of these metrics on each day in that period.

I want to write SQLs to calculate these metrics for every quarter by region (eg: US and rest of world) and calculate year-over-year growth (or decline) in these metrics.
Eg: enter image description here

Here are some details that I thought about -

Factless (transaction) fact table for user login activity with grain of 1 row per user per login : user_login_fact_schema (user_dim_key, date_dim_key, user_location_dim_key, access_method_dim_key)

Factless (transaction) fact table for user activity with grain of 1 row per user per activity : user_activity_fact_schema (user_dim_key, date_dim_key, user_location_dim_key, access_method_dim_key, post_key, activity_type_key)

Does this sounds correct? How should my model look like? What other dimensions/facts can I add here?

Wonder if I should collapse these 2 tables into 1 and have activity_type for logins as 'login', but there can be a huge number of logins without any activity so this will skew the data. Am I missing anything else?

Community
  • 1
  • 1
Saurabh Agrawal
  • 1,355
  • 3
  • 17
  • 33
  • 1
    I am wondering if you could you have designed your fact table to store counts of users instead of user id by the lowest grain (daily for you) and then aggregate as needed. Can you share what your final design looked like (paste bin)? I would be interested to see. – Appy May 20 '18 at 00:19
  • Saurabh, What was your final start-schema for this one? did you also include MAU and DAU/MAU ratio by any chance? – dim_user Jun 22 '19 at 04:01
  • Also, when you roll up the DAU by quarter, what would this be? the average or the last value of your quarter? – dim_user Jun 23 '19 at 06:17

2 Answers2

4

Your model seems correct, it answers the questions on the graph you posted.

It could make sense to aggregate those two fact tables into one fact table joined with a "UserAction" dimension, mostly because a login can be interpreted as just another user action.

However, having separate fact tables focused on one metric (or process) may be preferable because it enables you to introduce measures/metrics into the tables, i.e. when your fact tables stop being factless. It also spares you a join with another dimension (UserAction) but that is becoming a bit less relevant these days, where storage and DB processing power are just getting cheaper.

jmng
  • 2,479
  • 1
  • 25
  • 38
2

You should keep the data on different tables to make sure you dont mix different grains.

user_login_fact_schema can be a materalized view based on user_activity_fact_schema filtering for activity type=login and including some logic to exclude duplicates (i.e. one login per user per day, if you are talking about daily active users)

Victor HDC
  • 525
  • 1
  • 6
  • 12
  • 1
    @user:1098559 thanks for your reply. I don't think this will break the grain. What I meant is have 'login' as a new activity_type in user_activity_fact_schema table that way they can stay together in 1 table without violating the grain. – Saurabh Agrawal Mar 20 '18 at 23:09
  • 1
    If you have several logins per day I would consider that a mixed grain of "login events" + "distinct daily logins". It can work but you would probably end up using costly count distincts or window functions to calculate #active users. – Victor HDC Mar 21 '18 at 16:28