I need to change our data loads from full load into incremental and with this change we are going to rebuild whole ETL process.
Here’s how data source infrastructure looks like now:
We are having two production servers (for two different products), let’s say P1 and P2. On P1 our data sources are two databases: DB1 and DB2 located on the linked server S1. On P2 there is only one database DB3 on linked server S1. In the future there will be added another database DB4 for P2 and product P3 added. There are SQL views on server S1 presenting all of the data.
And our ETL:
Two different SSIS projects for P1 and P2, which actually differs only in connection strings. DB1 and DB2 merged by Union ALL SSIS component directly in Data Flow tasks. Currently SSIS packages are executing SQL queries stored inside tasks, change in ETL for P1 is results in redoing same change in ETL for P2. Data is loaded twice a day on P1 and every 5 minutes on P2, with data load everything is truncated and loaded into staging tables in both data warehouses.
Goal:
Our goal is to create one universal ETL process with parametrization that allows us to use DB3 when SQL is executed on server P2 and DB1+DB2 when it’s executed on server P1, with possibility to extend this to P3+DB4 and DB3. We’d also like to move SQL code from packages into Stored Procedures, so maintaining would be easier from developer perspective.
We also need to make ETL happen more often on P1 but at the same time we are not allowed to query whole dataset multiple times in short period on linked server this will be a problem on P2 once the dataset grows larger with time.
Things we want to avoid: Dynamic SQL.
What are the best practices to create incremental data loads and such parametrization in SSIS? We are in constant contact with developer responsible for server S1 and if we need any kind of view, he will be able to deliver it.