2

I am loading data on daily basis into a data vault model on Snowflake data warehouse.

I have split the ingest script (javascript procedure) into 3 main parts for logging purposes.

  1. Getting data into temporary table
  2. Metadata part, where I add data into several hubs and links that holds metadata of each row (location, data entry name…)
  3. Loading main data holding indicators and their values into specific hubs and satellites.

Here is the average loading time of each part of a data file having around 2000 rows with ~300k indicator values:

  1. 3 to 5 seconds for adding data from stage into temporary table
  2. 19 to 25 seconds for adding metadata into 9 hubs, satellites and links
  3. 3 to 5 seconds for adding 2000 rows into a hub and then 300k values into sat and link.

For part 2, whether there is a need to insert or not as I am using a merge statement it will take the same time.

Many things comes to my mind as loading thousands of records take few seconds while merging into few hubs (tables) if value not found originally is taking way more.

Can I replace all merge statements of tables related to part 2 and replace it with one conditional

insert all 

into table1 … where…

into table2 … where …

Can a insert into with conditions similar to when not matched of the merge statement may reduce the ingest time taking into considerations that where clause on each table will contain a select subquery to ensure existing data not added again?

I was reading this article on optimizing load into data vault model with its related scripts on github but still concerned about ingest time being reduced in an efficient way.

alim1990
  • 4,656
  • 12
  • 67
  • 130

1 Answers1

2

Admirable as Galavan's article is it comes with some fatal flaws around loading to the same hub in the case of same-as link or hierarchical links --- and that is, you will load duplicates. I would discourage you from using Multi-Table inserts to load hubs, links and satellites -- for analysis and testing on this please visit here: https://patrickcuba.medium.com/data-vault-test-automation-52c0316e8e3a

It's not to say MTI don't have a place in DV, they do! In the case of loading logarithmic PIT structures absolutely! An in-depth article on this is published here: https://patrickcuba.medium.com/data-vault-pit-flow-manifold-e2b68df26628

Now merges vs insert conversation in particular should not be in a Data Vault 2.0 vocabulary because DV2.0 is INSERT-ONLY. I did another piece on that here focussing on hashing but there is a segment discussing what happens at the micro-partition level in Snowflake that should discourage you from using MERGE INTO, visit here: https://patrickcuba.medium.com/data-vault-2-0-on-snowflake-5b25bb50ed9e

Seeing as you are building out your own DV automation tool these two blogs are worth a read too:

  • That's so interesting. As you said, DV2 is more of inserts and PITs. However, there is lots of repetitive data that does not relate to time but somehow into the metadata of the data. So adding everything or making all objects as PITs would ruin the idea behind what we are building. I will go through each part and link of your answer and will get back to you with more question. Anyway, I have a more of generic question. Why it takes the same time to load 1 row or 1000 rows ? Is it a good think or there is a flow in my script when dealing with less data ? – alim1990 Nov 13 '21 at 07:27
  • just to make my point clear. tables that I need to make an MTI into are just having `when not matched then` so there is no extensive querying while adding. – alim1990 Nov 15 '21 at 07:03