1

I'm working on a Spring project that needs exporting Redshift table data into local a single CSV file. The current approach is to:

  1. Execute Redshift UNLOAD to write data across multiple files to S3 via JDBC
  2. Download said files from S3 to local
  3. Joining them together into one single CSV file
UNLOAD (
  'SELECT DISTINCT #{#TYPE_ID} 
  FROM target_audience 
  WHERE #{#TYPE_ID} is not null 
  AND #{#TYPE_ID} != \'\' 
  GROUP BY #{#TYPE_ID}'
) 
TO '#{#s3basepath}#{#s3jobpath}target_audience#{#unique}_' 
credentials 'aws_access_key_id=#{#accesskey};aws_secret_access_key=#{#secretkey}' 
DELIMITER AS ',' ESCAPE GZIP ;

The above approach has been fine and all. But i think the overall performance can be improved by, for example skipping the S3 part and get data directly from Redshift to local.

After searching through online resources, i found that you can export data from redshift directly through psql or to perform SELECT queries and move the result data myself. But neither option can top Redshift UNLOAD performance with parallel writing.

So is there any way i can mimic UNLOAD parallel writing to achieve the same performance without having to go through S3 ?

Viet Hoang
  • 13
  • 5
  • There are only two ways to get data out of Redshift, execute a SQL query or unload to S3. If you don't want to use S3 then your only option is to run a query and write the result to a file in your code. The speed of that approach will depend on the query being run, the resources of the local machine etc. – Nathan Griffiths May 25 '18 at 04:20
  • My service is running on a `m4.large` EC2 instance. The query is for selecting ~ 60 million rows. I've tried the SQL query and the machine ran out of memory so i don't think its the right option . – Viet Hoang May 25 '18 at 07:58
  • OK, so it sounds like you've answered your own question, although I'm surprised that an m4.large instance can't handle a 60 million row output. Perhaps you can try batching the query to return smaller subsets of data? If there's a way of querying different chunks of the results set then maybe you can initiate multiple simultaneous queries and append the results together on the client side. Also check the way the data in the target_audience table is distributed in Redshift to make sure it's optimum for your query and that the data distribution isn't skewed to a subset of nodes. – Nathan Griffiths May 27 '18 at 04:25

2 Answers2

0

You can avoid the need to join files together by using UNLOAD with the PARALLEL OFF parameter. It will output only one file.

This will, however, create multiple files if the filesize exceeds 6.2GB.

See: UNLOAD - Amazon Redshift

It is doubtful that you would get better performance by running psql, but if performance is important for you then you can certainly test the various methods.

Nathan Griffiths
  • 12,277
  • 2
  • 34
  • 51
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 1
    Note that UNLOAD normally creates one file per slice in the cluster and these files are created in parallel. If you set PARALLEL OFF then each slice has to write out it's data to a single file one at a time, this is likely to be *much* slower than leaving parallel on. And as John mentioned, if the file size exceeds 6.2Gb then you will end up with multiple files anyway. – Nathan Griffiths May 27 '18 at 04:18
0

We do exactly same as you'r trying to do here. In our performance comparison, it found to be almost same or even better in some cases in our user case. Hence programming and debugging wise its easy. As there is practically one step.

//replace user/password,host,region,dbname appropriately in given command
psql postgresql://user:password@xxx1.xxxx.us-region-1.redshift.amazonaws.com:5439/dbname?sslmode=require -c "select C1,C2 from sch1.tab1" > ABC.csv

This enables us to avoid 3 steps,

  1. Unload using JDBC
  2. Download the exported Data from S3
  3. Decompress gzip file, (this we used to save network Input/Output).

On other hand also saving some cost(S3 storing, though its negligible). By the way, pgsql(9.0+) onwards, sslcompression is bydefault on.

Red Boy
  • 5,429
  • 3
  • 28
  • 41
  • Thank you for the answer ! Can you share some of the details like the system you used, the size of your dataset ? – Viet Hoang May 25 '18 at 08:04
  • @VietHoang, I use Redshift(DW) and MySql RDS(OLTP). My fact table has 500Million records, with 7 dim tables, Every export `select` query data size vary from 5M to 7M records, with data size of 30MB ~50 MB. I do import the RedShift exported data into MySql RDS. I hope this answers your question. Overall process takes 70~90 seconds. – Red Boy May 25 '18 at 08:43