If you're open to doing it in python, here's an example that should work:
import boto
import gzip
import psycopg2
import tempfile
# database connection setup
connection = psycopg2.connect('postgresql://scott:tiger@localhost/mydatabase')
connection.autocommit = True
cursor = connection.cursor()
# aws connection setup
s3_connection = boto.connect_s3('<aws access key>', '<aws secret key>')
bucket = s3_connection.get_bucket('<bucket>')
with tempfile.NamedTemporaryFile() as t:
with gzip.GzipFile(t.name, mode='wb') as g:
cursor.copy_expert("COPY ({0}) TO STDOUT WITH CSV HEADER".format('<select_query>'), g)
key = boto.s3.key.Key(bucket, '<s3_key>')
key.set_contents_from_filename(g.name)
This process makes use of the tempfile
module in python, which allows you to create a file that gets is used and then removed within the process. The context manager (with tempfile...
) simplifies the management of the file writing process, so you don't have to delete it manually. Depending on how you set up the tempfile, you can make the file accessible to, or never visible to, system users. Basically, you're streaming the SELECT statement to the STDOUT, and then writing the STDOUT to a tempfile. You're still beholden to your database for the SELECT statement in terms of memory management, speed and access.
The benefit is that you don't need to keep the entire file in memory while trying to transfer it to S3; the drawbacks are that you need enough disk space to temporarily store the file, and that it's obviously slower because you're writing to disk as opposed to doing the entire thing in memory.
The other thing to note is that I kept in the step where python is compressing the file using gzip before uploading. I did that to save space on upload; this is especially useful if you're uploading a table with a lot of repetitive data.
As an aside: you should not use this as-is in an environment where you're open to SQL injection; there are better ways of generating the COPY command if that's a part of your use-case.