I am building a star schema for an online business. One of the key processes is email newsletter signup.
But the analysis depends on two processes and I can't figure out how to model it the best way.
Here's how the process works:
- Person visits website
- Person fills out web form and is recorded as a contact in our CRM
- Person receives a link asking him to confirm if this is really his email
- Person clicks the link and is considered confirmed
- Person can now receive emails from us
The signup and confirmation process take place at different times. Most people click the confirmation link on the same day, but we send two follow up email over a few days after the signup so some people may confirm their email only after a few days.
On top of that a person could signup several times on the website. Most of our signups are people who exchange their email address in return for some sort of resource like an eBook.
As long as the person's email is not marked confirmed in our system, we ask the person to confirm on each signup.
Since we have multiple offers it's not uncommon for a person to request eBook A, eBook B and eBook C and only confirm after several signups.
In the fact table each signup for emails that are unconfirmed yet is marked as ConfirmationRequested -> True.
If the person clicks a confirmation link of ANY of the confirmation request emails he should be considered confirmed for each of those signups.
How I want to analyse the data
- See how many signups we had
- See how many signups were re-signups and how many were new contacts in the CRM (new email address)
- See how many new contacts have confirmed their email address (and become full subscribers)
- See how many re-signups were asked to confirm their email and how many have done so
- Analyse how long it takes for people to confirm their email address
- Analyse the confirmation rate
- Filter contacts by their confirmation status and analyse what people who have or have not confirmed have in common
I don't really care about confirmations in isolation from signups.
And for my purposes I would like to have a ConfirmationStatus dimension that is...
- "Confirmed" if the person confirms within 7 days of sign up
- "Pending" if the person hasn't confirmed, but 7 days haven't passed since signup yet
- "Not Confirmed" if the person hasn't confirmed within 7 days (even if the person does confirm at some later point)
On top of that I usually look at this report on Mondays to analyse the previous week and compare it to other weeks. (I already have a working version of this report in a flat table, but I am trying to learn how to build proper star schemas.)
This has the additional challenge that contacts that signed up on Sunday for example only had less than a day to confirm and would drag down the confirmation rate and the latest week would look bad if compared to previous week where all contacts had the full 7 days to confirm.
So I calculate a "Confirmed within signup week" confirmation count and rate for all weeks to allow apples to apples comparisons.
How to model this...
I have considered the following options...
Option #1: Separate fact tables Since these are separate processes that happen at separate times I have learned that I should create separate fact tables and then drill across common dimensions.
I could calculate signups that requested confirmations from the signup table and then calculate confirmations within a week of the signup through the contact and date dimensions.
But that wouldn't allow me to filter the signups by confirmation status.
That's why I am considering...
Option 2: A fact table that combines both signups and confirmations
I am thinking of something like this:
| Dim Signup Info | | | Dim Contact | | | Fact Signups | |
|-----------------------|------|---|-------------|------|---|----------------------|----|
| SignupInfoKey | SK | | ContactKey | SK | | SignupDateKey | FK |
| SignupType | SCD1 | | Name | SCD1 | | ConfirmationDate | FK |
| ConfirmationRequested | SCD1 | | Email | SCD1 | | SignupInfoKey | FK |
| ConfirmationSucceeded | SCD1 | | ... | | | ContactKey | FK |
| ConfirmationStatus | SCD1 | | | | | SignupId | DD |
| | | | | | | SignupDateTime | DD |
| | | | | | | ConfirmationDateTime | DD |
| | | | | | | Signups | M |
| | | | | | | NewContacts | M |
| | | | | | | ConfirmationMin | M |
| | | | | | | ConfirmationDays | M |
I need the ConfirmationDate in the fact to calculate the "Confirmed Within Week" measures at report time (I am using powerbi and it's easy there). I could of course also create a dimension "ConfirmedWIthinWeek" and then filter based on that, but it won't be as flexible... What if I decide later on to look at the data on a daily or monthly basis for example?
Another concern is that it will require to reprocess and update the fact tables on each incremental load for the past 7 days.
I know that's ok for dimensions, but is that ok for fact tables too?
So my questions are
- Is option #2 a good solution or is there a better way to do this?
- Is it ok to update fact tables or is that discouraged?
Overall my question is: What am I missing?
This seems like a very common thing. One obvious example would be an order star that has fact table columns for AmountOrdered, AmountPaid, AmountRefunded and dimensions like "Order Status", "Paid Status" and "Refunded Status".
But none of my searches have resulted in answers to this common problem. Surely there must be a term for the problem and a pattern name for the solution where I can learn more about it?