How to handle Slowly Changing Dimension in Amazon Redshift using Pentaho?

Question

Since Amazon Redshift is optimized for reading instead of writing, how can I manage a Slowly Changing Dimension procedure using an ETL tool, in my case Pentaho Data Integration?

As the ETL tool would do updates/inserts (Dimension Lookup/Update) line by line, performance will be extremely low.

Has anyone already been through this issue?

What percentage of your dimension rows will actually be changed/inserted? If the percentage is smallish (< 20% or so) the `Dimension Lookup/Update` step will probably be fine. — Brian.D.Myers, May 18 '16 at 21:41

score 1 · Answer 1 · answered May 23 '16 at 02:29

The updates in Redshift are slow because an update is a sequence of operations performed in a transaction:

Select rows to be updated into a temp table
Delete those rows
Update those rows in the temp table according to the update condition
Append the updated rows to the original table

And all that has to be coordinated across nodes.

Updating a single row may take as long as updated 1000 rows. What's worse is that because updates are so long and require write locks, they block the queries for long time significantly affecting overall system performance.

There are 3 ways to make it faster (all from experience):

Avoid updates.

If you have a condition which can allow you to distinguish the new rows from the old ones, just append the new ones to the table, and modify your queries with that condition. You'd be surprised to see that Redshift is performing faster - even though each query might have become a bit more complex, because there are no updates which overload the system, those queries may run faster (make sure the dist keys are right).

For example, a condition on maximum timestamp per business key suprisingly runs very fast (especially if you business key is your dist key - it's all going to run in parallel).

This is the preferable solution.
Perform updates in batches.

If your update is applicable to a range of rows, update them all at once with a where condition. Batches of 1000 work well, though your mileage may vary.
Create a table into which you store "new" rows, and then make an update using the join after that table is at least 1000 rows large.

How to handle Slowly Changing Dimension in Amazon Redshift using Pentaho?

1 Answers1