3

I am developing a multi-step data pipeline that should optimize the following process:

1) Extract data from a NoSQL database (MongoDB).

2) Transform and load the data into a relational (PostgreSQL) database.

3) Build a data warehouse using the Postgres database

I have manually coded a script to handle steps 1) and 2), which is an intermediate ETL pipeline. Now my goal is to build the data warehouse using the Postgres database, but I came across with a few doubts regarding the DW design. Below is the dimensional model for the relational database:

enter image description here

There are 2 main tables, Occurrence and Canonical, from which inherit a set of others (drawn in red and blue, respectively). Note that there are 2 child data types, ObserverNodeOccurrence and CanonicalObserverNode, that have an extra many-to-many relationship with another table.

I made some research regarding how inheritance should be implemented in a data warehouse and figured the best practice would be to merge together the family data types (super and child tables) into a single table. Doing this would imply adding extra attributes and a lot of null values. My new dimensional model would look like the following:

enter image description here

Question 1: Do you think this is the best approach to address this problem? If not, what would be?

Question 2: Any software recommendations for on-premise data warehouses? (on-premise is a must since it contains sensitive data)

GRoutar
  • 1,311
  • 1
  • 15
  • 38
  • Possible duplicate of [How do you effectively model inheritance in a database?](https://stackoverflow.com/questions/190296/how-do-you-effectively-model-inheritance-in-a-database) – philipxy Jul 01 '19 at 20:31
  • @philipxy this is a "best design approach for a data warehouse based on a relational database that implements inheritance". Slightly different. – GRoutar Jul 02 '19 at 08:41

2 Answers2

2

Usually having fewer tables to join and denormalizing data will improve query performance for data warehouse queries, so they are often considered a good thing.

This would suggest your second table design. NULL values don't occupy any space in a PostgreSQL table, so you need not worry about that.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
1

As described here there are three options to implement inheritance in a relational database.

IMO the only practicable way to be used in data warehouse is the Table-Per-Hierarchy option, which merges all entities in one table.

The reason is not only the performance gain by saving the joins. In data warehouse often the historical view of the data is important. Think, how would you model a change in a subtype in some entity?

An important thing is to define a discriminator column which uniquely defines the source entity.

Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53