3

As we are planning to migrate the data from Teradata to google cloud(Bigquery). In Teradata we have key concepts like primary and foreign with help of this keys we are able to define relation between dimension and fact.

Say for example I have 3 dimension tables and one fact table as shown below.

D1 D2 D3

F1

with the help of keys or indexes in Teradata we can able to fetch the data from fact table.

When coming to Bigquery we do not have any concept like keys or indexes then how we are going to define relation between the dimension and fact

Note: If there are no primary keys or index concept how we are going to eliminate the duplicates

sethu
  • 241
  • 1
  • 5
  • 13

2 Answers2

8

Primary keys, facts, and dimensions are concepts that previous generation data warehouses needed to rely on for performance.

For example, in Teradata a primary index key is needed to distribute the data between nodes. This distribution will be key to enable fast performing queries later - but in the BigQuery world this type of pre-planned distribution among nodes is not needed.

Same thing with facts, dimensions, star schemas, OLAP cubes... the main reason for their existence is to enable faster queries later - by restricting what can be queried and among which dimensions. You won't need to worry about this with BigQuery.

Instead of dividing into its normal form, it makes sense to have a flat table with all the dimensions incorporated in BigQuery. Arbitrary JOINs will be fast too - but flat tables and nested data are easy to handle here.

Now that you are not restricted by older tech needs - removing duplicates becomes a different type of operation.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 1
    Great answer! Please @sethu, if Felipe's answer helped you consider marking it as accepted and also voting it up :)! – Willian Fuks Jul 28 '17 at 00:29
1

Think of how you deal with that flat table when it comes to SCD1 and SCD3.

For both these you need to run updates on target flat table or generate a new flat table from scratch vs just updating the dimension table.

The current generation of DWH don't keep the consistency and historical data like the old DWH. The only reason current generation still can do flat tables is because they either consider immutable data or they don't follow consistency rules. This will be over in a couple of years once data out grows the infrastructure and these ways of modeling again and we will be back to dimensional models again. It already is like that if you work with PBs of data, you don't reload that data over and over it costs too much so you try to split it into incremental batches, split immutable for none immutable aka facts and dimensions.

I would do like always if you have more than a couple of TB data or if you need to load the DWH hourly. If you are below one TB of data and only needs daily updates in the DWH probably the easiest is to reload all every time.