0

Im going to run query that returns a huge table (about 700Mb) from Redshift and save it to CSV using SQLAlchemy and python 2.7 on my local machine (mac pro). I've never done this with such a huge queries before and obviously there could be some memory and other issues.

My question is what i shall take into account and how to use sql alchemy in order to make the process work?

Thanks, Alex

user912830823
  • 1,179
  • 3
  • 14
  • 25

2 Answers2

3

Ideally, you'd use the UNLOAD command that will move this to S3 storage then take that to your local machine. Here is an example:

UNLOAD ('select * from my_table')
  TO 's3://bucket_name/path/to/my_filename_prefix'
  WITH CREDENTIALS
    'aws_access_key_id=<my_access_key>;
    aws_secret_access_key=<my_secret_key>'
  MANIFEST
  GZIP
  ALLOWOVERWRITE
  ESCAPE
  NULL AS '\\N'

ref http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD_command_examples.html

If a local query is the only or preferred option, here are two things to limit memory usage:

  1. Minimize operations

  2. Loop through results from the query one at a time

Here is an example snippet if edits are done to your data:

result = conn.execute(s)
row = result.fetchone()

with open(outfile, 'ab') as csvfile:
    while row is not None:
        csvfile.write(txt)
        row = result.fetchone()

Otherwise just write all rows:

# dump column titles (optional)
csvfile.writerow(x[0] for x in cursor.description)
# dump rows
csvfile.writerows(cursor.fetchall())

outfile.close()

snippet credit: https://stackoverflow.com/a/2952829/7853322

brddawg
  • 434
  • 8
  • 19
0

If you don't run much else on that machine then memory should not be an issue. Give it a try. Monitor memory use during the execution. Also use "load" to see what pressure on the system is.