6

I'm trying to model my data warehouse using a star schema but I have a problem to avoid joins between fact tables.
To give a trivial idea of my problem, I want to collect all the events who occur on my operating system. So, I can create a fact table event with some dimensions like datetime or user. The problem is I want to collect different kinds of event: hardware event and software event.
The problem is those events have not the same dimensions. By instance, for a hardware event, I can have physical_component or related_driver dimensions and, for a software event, software_name or online_application dimensions (that is just some examples, the idea to keep in mind is the fact event can be specialized into some specific events with specific dimensions).
In a relational model, I would have 3 tables organized like that: enter image description here
The problem is : how to handle joins between fact tables in a star schema?


I imagined 4 ideas but I'm not sure one of them are adapted to the situation.
The first one is to keep the model used in a relational database and add the dimension tables like that:
enter image description here
In this case, the problem is we still have join between fact tables and need to use JOIN SQL statement in all of our queries.
The second one is to create only 2 fact tables who will duplicate the shared dimensions (datetime and user) and to create a materialized view event who summarized all the events:
enter image description here
The problem here is: what happen if I want to make a query on the materialized view? According to what I read in the Oracle documentation, we don't have to make query directly on materialized view but we have to let the query rewrite process make its work.
The third one is to create only one fact table who will contain all the information possible for an event (hardware or software):
enter image description here
This time, the problem is my fact table will contain a lot of NULL value.
And the last one is to create 3 fact tables (without materialized view this time) like this:
enter image description here
This time, the problem is all events are present in the fact table event and in its own table. Because we will store a huge quantity of data, I'm not sure this duplication is a good idea.
So what is the best solution? Or does it exist a fifth solution?
Pierre
  • 1,942
  • 3
  • 23
  • 43
  • How similar are the software and hardware events in their data structure (content-wise, not column names)? If similar you could just use a single dimension like "Source" with some shared columns and some non-shared that are set to 'n/a' for the type they don't apply to. – Cyrus Mar 06 '17 at 16:22
  • @Cyrus Software and hardware events are totally different in their structure (number of column) and in their content (no link between the column of each table). All the shared columns are in the event table. – Pierre Mar 06 '17 at 16:26

4 Answers4

4

From your description and your subsequent comments to other answers, I'd say that option 2 or option 4 are the right way to model things from a dimensional modelling perspective. Each fact should be a measure of a business process, and the dimensionality of software and hardware events seems to be sufficiently different that they warrant being stored separately.

Then, there's a case for also storing the separate events table as a view, materialised view, or plain-ol' table storing the things that are common.

Once you've decided that's the right way to model things 'logically', you then need to balance performance, maintainability, usability and storage. For dimensional modelling, usability and performance of queries take top priority (otherwise you may as well not use a dimensional model at all), and the extra work in ETL, and extra space needed, are prices worth paying.

A non-materialised view would save you the space at the price of performance, but it could be that you could give it a sufficiently awesome index or two that would mitigate that. A materialised view will give you performance at the price of storage.

I'd be tempted to create the two fact tables with indexes and a non-materialised view, and see what performance of that is like before taking further performance enhancing steps. 10 million fact rows isn't so bad, it might still perform.

A materialized view can be queried directly. But if you want to, you can use the query rewrite capabilities of Oracle so that the Materialized view is instead used as a performance-enhancer, like an index, when you're querying the original tables. See here for details: http://www.sqlsnippets.com/en/topic-12918.html Whether you choose to use it in query rewrite mode or just as a view in its own right depends on whether you want the users to know about this extra table, or for it to just sit in the background as a helpful friend.

Rich
  • 2,207
  • 1
  • 23
  • 27
  • According to your last sentence, you prefer to choose the option 2. But what if a non-materialized view give me bad performance (high performance is the key of my problem)? I guess I should create a materialized view. But because it's not possible to query a materialized view directly, how can I be sure my queries will be rewritten to use this materialized view? Or, if performance is bad, I should use the fourth model? – Pierre Mar 07 '17 at 09:42
  • Yes, if a non materialized view gives you bad performance, I'd agree, a materialized view could be created. A materialized view *can* be queried directly. However, if you *choose* to, you can set up the materialized view so that the information in the materialized view can be used to satisfy queries to the original tables, with no changes to the queries. This way, your queries would stay the same and the materialized view would be used automatically to help make the query faster. For more of a description see this: http://www.sqlsnippets.com/en/topic-12918.html. I will edit answer. – Rich Mar 07 '17 at 12:22
0

There doesn't seem to be a reason in your scenario to combine or link the two types of events. Having said that, you may have some reason you did not describe (for example, collecting logs from many systems and wanting to see them together easily).

So my advice is to make a single fact table with both hardware and software dimension keys. One of them is always going to be 0 or -1 (= default 'n/a' record).

This allows you to aggregate them together without UNION statements or other complicated logic and can even support events that are linked to both hardware and software if they appear in the future.

Cyrus
  • 2,135
  • 2
  • 11
  • 14
  • That's the idea I present as my third choice. But is it a problem to have a lot of NULL value? You purpose to replace them by 0 or -1 but, according to this answer : http://stackoverflow.com/a/8310891/7334162 , it doesn't seem to be a good idea for future calculation, right? – Pierre Mar 07 '17 at 07:54
  • @Pierre Measures should be NULL whenever they are not present in the source data, but dimension keys should never be, there a default record makes much more sense. – Cyrus Mar 07 '17 at 08:00
  • Don't you think it's going to take much more time when I will make a request on just one kind of event (hardware, for instance) with this model rather than have one fact table per kind of event? Or do you think it's better to create materialized view to distinguish hardware or software events? – Pierre Mar 07 '17 at 08:04
  • Databases are optimized to deal with this kind of query, so it shouldn't matter unless you have millions/billions of rows. All these joins do need proper primary key and index definitions though. – Cyrus Mar 07 '17 at 08:08
  • I plan to store more than 10 millions of record. I'm looking for the most optimized model to reduce at its maximum each request I will make. – Pierre Mar 07 '17 at 08:14
0

You would never/rarely join fact tables together. You may join aggregated facts which share (conformed) dimensions (i.e. Number of software events per hour compared with number of hardware events per hour).

To me, you always have to consider the kinds of questions that are going to be asked when looking at dimensional modeling.

Cade Roux
  • 88,164
  • 40
  • 182
  • 265
  • In my case, I will query the database to get information about all the events (hardware and software), that's why I need a global event fact table ; but I will also query the database to get information about specific event (hardware or software), that's why I prefer to have separated fact tables. After that, I want to create materialized view to get aggregated information. That's why I still don't which model is better in my case. – Pierre Mar 07 '17 at 07:52
  • I would model with two fact tables. Remember to also keep in mind that the dimensions are put into tables more pragmatically according to data behavior, so don't forget the concepts of junk dimensions and degenerate dmensions. – Cade Roux Mar 07 '17 at 08:24
  • What has a degenerate dimension got to do with this? – Rich Mar 07 '17 at 08:41
  • In deciding to have two separate fact tables there are likely going to be many more distinct differences in the degenerate dimensions when it comes to the whole model that isn't shown here. To me the crux of the modeling question is hidden behind what it means to "get information about all the events" or "need a global event table". Of course any model will have all the data. The question is how to arrange the data so that the generally expected use cases perform. When aggregating, how are they used? When drilling down, how are they used? – Cade Roux Mar 07 '17 at 08:51
  • To elaborate on the degenerate dimensions, a source for software events might have a lot of stack traces and text data dumping states of variables etc. For hardware, likely to have more text of hardware state and configuration data. I am trying to understand where there is a case for combining them in the fact table because they will be grouped together in the same analysis in anything other than a conformed dimension like user or date or business unit/department location. – Cade Roux Mar 07 '17 at 08:57
0

Events should be a single fact. If you split them in two, you'll have a difficult time doing aggregations across both.

If necessary, you can have separate hardware and software attribute dimensions, but you should have a generic event dimension, even if it is just a junk dimension with a few simple attributes, e.g. type (hardware/software), criticality (high, low), etc.

On a side note, I've generally seen the diagrams with the arrows coming from the fact going to the dimensions. The fact table keys look at the dimensions rather then the other way around.

Wes H
  • 4,186
  • 2
  • 13
  • 24
  • I'm not sure create a junk dimension will be enough because there are a lot of specific dimensions for hardware only and a lot of other specific dimensions for software only (and not just discrete variables). About diagrams, you're right, my mistake. – Pierre Mar 07 '17 at 14:49