2

I am building a star schema for an online business. One of the key processes is email newsletter signup.

But the analysis depends on two processes and I can't figure out how to model it the best way.

Here's how the process works:

  • Person visits website
  • Person fills out web form and is recorded as a contact in our CRM
  • Person receives a link asking him to confirm if this is really his email
  • Person clicks the link and is considered confirmed
  • Person can now receive emails from us

The signup and confirmation process take place at different times. Most people click the confirmation link on the same day, but we send two follow up email over a few days after the signup so some people may confirm their email only after a few days.

On top of that a person could signup several times on the website. Most of our signups are people who exchange their email address in return for some sort of resource like an eBook.

As long as the person's email is not marked confirmed in our system, we ask the person to confirm on each signup.

Since we have multiple offers it's not uncommon for a person to request eBook A, eBook B and eBook C and only confirm after several signups.

In the fact table each signup for emails that are unconfirmed yet is marked as ConfirmationRequested -> True.

If the person clicks a confirmation link of ANY of the confirmation request emails he should be considered confirmed for each of those signups.

How I want to analyse the data

  • See how many signups we had
  • See how many signups were re-signups and how many were new contacts in the CRM (new email address)
  • See how many new contacts have confirmed their email address (and become full subscribers)
  • See how many re-signups were asked to confirm their email and how many have done so
  • Analyse how long it takes for people to confirm their email address
  • Analyse the confirmation rate
  • Filter contacts by their confirmation status and analyse what people who have or have not confirmed have in common

I don't really care about confirmations in isolation from signups.

And for my purposes I would like to have a ConfirmationStatus dimension that is...

  • "Confirmed" if the person confirms within 7 days of sign up
  • "Pending" if the person hasn't confirmed, but 7 days haven't passed since signup yet
  • "Not Confirmed" if the person hasn't confirmed within 7 days (even if the person does confirm at some later point)

On top of that I usually look at this report on Mondays to analyse the previous week and compare it to other weeks. (I already have a working version of this report in a flat table, but I am trying to learn how to build proper star schemas.)

This has the additional challenge that contacts that signed up on Sunday for example only had less than a day to confirm and would drag down the confirmation rate and the latest week would look bad if compared to previous week where all contacts had the full 7 days to confirm.

So I calculate a "Confirmed within signup week" confirmation count and rate for all weeks to allow apples to apples comparisons.

How to model this...

I have considered the following options...

Option #1: Separate fact tables Since these are separate processes that happen at separate times I have learned that I should create separate fact tables and then drill across common dimensions.

I could calculate signups that requested confirmations from the signup table and then calculate confirmations within a week of the signup through the contact and date dimensions.

But that wouldn't allow me to filter the signups by confirmation status.

That's why I am considering...

Option 2: A fact table that combines both signups and confirmations

I am thinking of something like this:

| Dim Signup Info       |      |   | Dim Contact |      |   | Fact Signups         |    |
|-----------------------|------|---|-------------|------|---|----------------------|----|
| SignupInfoKey         | SK   |   | ContactKey  | SK   |   | SignupDateKey        | FK |
| SignupType            | SCD1 |   | Name        | SCD1 |   | ConfirmationDate     | FK |
| ConfirmationRequested | SCD1 |   | Email       | SCD1 |   | SignupInfoKey        | FK |
| ConfirmationSucceeded | SCD1 |   | ...         |      |   | ContactKey           | FK |
| ConfirmationStatus    | SCD1 |   |             |      |   | SignupId             | DD |
|                       |      |   |             |      |   | SignupDateTime       | DD |
|                       |      |   |             |      |   | ConfirmationDateTime | DD |
|                       |      |   |             |      |   | Signups              | M  |
|                       |      |   |             |      |   | NewContacts          | M  |
|                       |      |   |             |      |   | ConfirmationMin      | M  |
|                       |      |   |             |      |   | ConfirmationDays     | M  |

I need the ConfirmationDate in the fact to calculate the "Confirmed Within Week" measures at report time (I am using powerbi and it's easy there). I could of course also create a dimension "ConfirmedWIthinWeek" and then filter based on that, but it won't be as flexible... What if I decide later on to look at the data on a daily or monthly basis for example?

Another concern is that it will require to reprocess and update the fact tables on each incremental load for the past 7 days.

I know that's ok for dimensions, but is that ok for fact tables too?

So my questions are

  • Is option #2 a good solution or is there a better way to do this?
  • Is it ok to update fact tables or is that discouraged?

Overall my question is: What am I missing?

This seems like a very common thing. One obvious example would be an order star that has fact table columns for AmountOrdered, AmountPaid, AmountRefunded and dimensions like "Order Status", "Paid Status" and "Refunded Status".

But none of my searches have resulted in answers to this common problem. Surely there must be a term for the problem and a pattern name for the solution where I can learn more about it?

user3505886
  • 710
  • 7
  • 7
  • Some general notes: 1. Avoid multiple fact tables if possible; 2. Processes with multiple steps and dates can often be recorded modelled in one record with 'days between' metrics in them. This indeed requires updating the fact table (that's fine) 3. I don't see any analysis required on the actual items requested. Are these mostly irrelevant? That's where I see the complexity - if you need to track detail down to which things were requested and what state they are in, that could get complicated – Nick.Mc Dec 27 '19 at 09:14
  • With regards to your Order example, what stops you from using that model? It seems to me you just need to add `ConfirmedDate`, `DaysFromSignupToConfirmation`. Then I think you have most of what you need. – Nick.Mc Dec 27 '19 at 09:20
  • With PowerBI you can get some pretty powerful functionality, like this parameter table pattern, where you can define a variable and have it trigger a complex algorithm (rather than boring filters) https://www.daxpatterns.com/parameter-table/ – Nick.Mc Dec 27 '19 at 09:21
  • Thank you so much for your answer. So looks like option #2 is a correct way to model it. After reading “the star schema - the complete reference” it seemed like you should avoid combining processes in one fact table. – user3505886 Dec 28 '19 at 12:02
  • And regarding the order example... I wasn’t sure if that was the correct way to do it. All the examples of order fact tables didn’t have payments either. I would love to be able to browse through the models other people have created with explanations why it was created that way. But haven’t found anything. – user3505886 Dec 28 '19 at 12:03
  • I'm a strong beleiver in star schemas but they aren't perfect and you'll have difficulty finding answer to the edge cases / imperfections. In the Order example (where orders might have zero or many payments), I've seen examples where 1. The order header appears as another record with payments in one fact table, or 2. It's split across payments, or 3. It's done in two facts with an enormous (and pointless) order header dimension to bind the two. I like the first option myself - fact tables are large and unwieldy and I avoid multiple facts if I can. – Nick.Mc Dec 28 '19 at 12:09
  • Data modelling is not a science. I just take the Kimball stuff as thinking points and gotchas but they aren't hard rules. In my opinion you need to keep your eye on the ball - a simple model that's easy to use for an end user. – Nick.Mc Dec 28 '19 at 12:11

0 Answers0