To automate archiving of redshift tables(which are too large to handle) to AWS S3 Bucket

Question

I need suggestions/help regarding how to automate archiving of redshift tables(which are too large to handle) to AWS S3 Bucket.In my case we want to keep last 90 days in the redshift table and unload whatever data to S3 bucket.We are looking to run the job once a month and will be keeping data on the basis of one of the date column present in the redshift table.Any help would be great.Looking for python script .

score 0 · Accepted Answer · answered Jan 18 '19 at 15:58

0

So I believe the AWS way of handling the problem is to use Data Pipelines. These are jobs that can be scheduled so they are very cost effective. I believe this is similar to scheduling-data-extraction-from-aws-redshift

answered Jan 18 '19 at 15:58

David Webster

2,208
1
16
27

Thanks, David for sharing the Data Pipeline concept as I am new to this technology will do some poc whether it works or not. I was thinking to write a python script which will unload a specific table and copies data which is over 90 days. I am also stuck how to make a directory structure in S3 like year->month so that if I run my job once a day it will go to that year->month .something like that so that it easy to handle and recall data later on. – deepak Singh Jan 18 '19 at 20:46
The python script on a scheduled Lambda can also work. When you are making the call to s3 to save your item you specify a key i.e /2019/1/19/filename any paths that do not exist in the bucket will be created as a folder structure. Goodluck – David Webster Jan 19 '19 at 06:37
Could you please share any sample script which performs this? It will be of great help to me in understanding the concept. – deepak Singh Jan 19 '19 at 20:46

score 0 · Answer 2 · answered Jan 19 '19 at 05:10

You should use the UNLOAD command in Amazon Redshift to save data to Amazon S3.

From Python, you can use a library such as psycopg2 to connect to Redshift because it behaves like a PostgreSQL database.

Take a look at How to Load Data into Amazon Redshift via Python Boto3? for an example of running Redshift commands from Python. (That answer shows a COPY command, but it's the same method of connecting.)

See also: Access your data in Amazon Redshift and PostgreSQL with Python and R - Blendo

You will need to figure out the commands to run, such as:

Selecting relevant data to unload
Deleting the data once is has been unloaded

A common practice is to place data in monthly tables, eg a table for January, another table for February, etc. Then, use CREATE VIEW to create a combined view of those tables with UNION commands. The View can then be used like a normal table. Later, when you wish to archive some data, unload the oldest table to S3, delete it and update the View. That way, you can archive a whole month without having to delete individual rows.

To automate archiving of redshift tables(which are too large to handle) to AWS S3 Bucket

2 Answers2