1

I need to change our data loads from full load into incremental and with this change we are going to rebuild whole ETL process.

Here’s how data source infrastructure looks like now:

We are having two production servers (for two different products), let’s say P1 and P2. On P1 our data sources are two databases: DB1 and DB2 located on the linked server S1. On P2 there is only one database DB3 on linked server S1. In the future there will be added another database DB4 for P2 and product P3 added. There are SQL views on server S1 presenting all of the data.

And our ETL:

Two different SSIS projects for P1 and P2, which actually differs only in connection strings. DB1 and DB2 merged by Union ALL SSIS component directly in Data Flow tasks. Currently SSIS packages are executing SQL queries stored inside tasks, change in ETL for P1 is results in redoing same change in ETL for P2. Data is loaded twice a day on P1 and every 5 minutes on P2, with data load everything is truncated and loaded into staging tables in both data warehouses.

Goal:

Our goal is to create one universal ETL process with parametrization that allows us to use DB3 when SQL is executed on server P2 and DB1+DB2 when it’s executed on server P1, with possibility to extend this to P3+DB4 and DB3. We’d also like to move SQL code from packages into Stored Procedures, so maintaining would be easier from developer perspective.
We also need to make ETL happen more often on P1 but at the same time we are not allowed to query whole dataset multiple times in short period on linked server this will be a problem on P2 once the dataset grows larger with time.

Things we want to avoid: Dynamic SQL.

What are the best practices to create incremental data loads and such parametrization in SSIS? We are in constant contact with developer responsible for server S1 and if we need any kind of view, he will be able to deliver it.

Dodzik
  • 360
  • 9
  • 24
  • Use staging tables, load all data then use SQL to do the incremental stuff. If you have a table with many rows, then use a parametrized SELECT in your SSIS so you don't have to load the full table every time. In this last case you will have to know how to retrieve the new, latest rows and that depends on the data stored (created or modified date, identities, etc). Once the operation ends, register the exact datetime of the execution so you can start from that point onwards on the next run (I usually substract a few hours to be sure). This approach needs all your tables to have keys! – EzLo Jun 05 '18 at 09:00
  • You have a package PK1 wtih Data Flow task and currently it has two OLE DB Sources DB1 and DB2. It unions the data sets together and off it goes to.. S7, DB7 You also package PK2 which looks like PK1 less the unioning of data sets, yeah? – billinkc Jun 05 '18 at 17:17
  • Assuming the above is correct, the desire is to have a package that pulls data from N data sources (since the data model is _consistent_ across these database instances) and shoves into our destination – billinkc Jun 05 '18 at 17:18
  • Would it be acceptable to rework Package1 into two data flows - One pulls the data from DB1 and then comes back in and pulls the data from DB2. Net effect is the same(ish) data is brought to the target but greatly simplifies the problem from an SSIS perspective – billinkc Jun 05 '18 at 17:20
  • @billinkc you are right, but is this possible to achieve such thing in SSIS? I mean that part "One pulls the data from DB1 and then comes back in and pulls the data from DB2". It should be dependent on the server that's executed on. – Dodzik Jun 06 '18 at 12:59

1 Answers1

3

The general pattern I would take would be something like this.

My control flow will identify the databases on the server associated with our project (Connection Manager = Source)

enter image description here

Here I show a query against sys.databases because maybe you can apply a criteria like AND D.Name IN ('DB1', 'DB2', 'DB3');

On S1, that query would return 2 values, on S2, only 1.

We'd use that list of databases as the source of a ForEach Loop Enumerator to "shred" the results. For each value we identified in the original query (DB1, DB2) we're going to update the InitialCatalog property of our Source ConnectionManager. In the reference answers below, I set the ConnectionString property but you will only want to modify the InitialCatalog. So each loop around, the database pointed to will change.

The Data Flow inside the ForEach enumerator is then simplified to only deal with the current database and not have to worry about whether this server has 3 source databases or 1.

enter image description here

Caveats

The source query and data types must be compatible across all the associated databases. The structure of a Data Flow is set at design-time and cannot be changed during run-time.

If the entities are consistent across the databases and it's just columns being called something different, create a view across each database to ensure the entity name is consistent and then you can avoid dynamic SQL.

You will need to provide the initial value of the Source connection string when the package begins. This can be accomplished via the SET attribute on invocation.

Reference answers

Some relevant SSIS answers that explore these concepts

billinkc
  • 59,250
  • 9
  • 102
  • 159