How to handle Slowly Changing Dimension Type 2 in Redshift?

Question

I want to track username changes overtime.

I have the following users table in Redshift:

id     username     valid_from     valid_to     current    
--------------------------------------------------------
1      joe1         2015-01-01     2015-01-15   No
1      joe2         2015-01-15     NULL         Yes

My source data is from RDS Postgres. I'm thinking of several options on how to handle this:

1) Create users_history table and start tracking this inside RDS Postgres db. This requires me making changes to my app and this table potentially can get huge

2) Have an ETL process and query the users source table like every 5 minutes looking for new changes (sort by last updated_at) and dump it to DynamoDB.

3) Have an ETL process dumping data to S3, then COPY it into a temporary table inside Redshift and doing query update there

Can you help advice what is scalable and easily maintainable in the long run? Remember these tables can be massive and I'll be tracking SCD for many tables.

Thanks.

Update 1: I chatted with AWS support and they showed me this, seems like a good solution: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html

score 1 · Answer 1 · answered Dec 17 '15 at 05:53

1

In terms of SQL/ ETL implementation, Redshift supports anything that an RDS will support. So, you decision should be based on what are the constraints and expectations from the database.

Redshift is a read optimized system, so updates every few minutes will likely slow it down for query purposes. (Micro-ETLs are not much recommended on Redshift)

On the other hand, if you are likely to have huge tables, Redshift will perform better that most row-store databases (like MySQL, Postgre etc.). This delta in performance will increase with the growth of your data size as Redshift is designed for bigger scales than traditional systems.

answered Dec 17 '15 at 05:53

Paladin

570
3
13

Hi @Paladin, yea I get that about Redshift, do stuffs in batches. But what I don't know how to handle those micro-update (update changing data already in RS)? Do people UNLOAD and do those update outside of RS and COPY it back? The solution about merging posted at my update seem slow. – Kien Pham Dec 18 '15 at 07:10
Some people have implemented a hybrid solution to come around micro-loads. They do ETL in RDS(mysql/ postgre etc.) and on a daily basis, load everything into Redshift tables. All the reports/ analytics run on Redshift, where it is generally acceptable to have a day old data. Redshift is not an OLTP system, so expecting real-time updates on that is not very recommended. – Paladin Dec 19 '15 at 17:13
in my case, how do I do ETL outside of Redshift? Ex: In Redshift I have the users table, that recorded username for the last 6 months. Now today 10 users decided to change their usernames. I want to keep track of this change. What you are saying is I keep track of all this changes outside of Redshift and wipe the users table in Redshift then overwrite Redshift with new data? – Kien Pham Dec 20 '15 at 07:29
1

Not necessarily overwrite the full table, you can do a merge also. You can have the same table in a MySQL RDS and do the updates/ ETL through the day there. At the end of the day, merge it with Redshift table which brings both the tables in sync. Repeat the same next day. – Paladin Dec 20 '15 at 08:04
How do we go about this EOD merge, can some ETL tool like Pentaho be useful ?? – Sarang Manjrekar Nov 25 '16 at 06:13

How to handle Slowly Changing Dimension Type 2 in Redshift?

1 Answers1

Linked