I am loading data on daily basis into a data vault model on Snowflake data warehouse.
I have split the ingest script (javascript procedure) into 3 main parts for logging purposes.
- Getting data into temporary table
- Metadata part, where I add data into several hubs and links that holds metadata of each row (location, data entry name…)
- Loading main data holding indicators and their values into specific hubs and satellites.
Here is the average loading time of each part of a data file having around 2000 rows with ~300k indicator values:
- 3 to 5 seconds for adding data from stage into temporary table
- 19 to 25 seconds for adding metadata into 9 hubs, satellites and links
- 3 to 5 seconds for adding 2000 rows into a hub and then 300k values into sat and link.
For part 2, whether there is a need to insert or not as I am using a merge statement it will take the same time.
Many things comes to my mind as loading thousands of records take few seconds while merging into few hubs (tables) if value not found originally is taking way more.
Can I replace all merge statements of tables related to part 2 and replace it with one conditional
insert all
into table1 … where…
into table2 … where …
Can a insert into with conditions similar to when not matched of the merge statement may reduce the ingest time taking into considerations that where clause on each table will contain a select subquery to ensure existing data not added again?
I was reading this article on optimizing load into data vault model with its related scripts on github but still concerned about ingest time being reduced in an efficient way.