4

This answer suggests using AWS Data Pipeline but I'm wondering if there's a clever way to do it with my own machine and Python.

I've been using psycopg2, boto3 and pandas libraries. Tables have 5 to 50 columns and few million rows. My current method doesn't work that well with large data.

Tsingis
  • 508
  • 3
  • 8
  • 25

3 Answers3

8

Guess I can show one of my own versions here aswell which is based on copy_expert in psycopg2

import io
import psycopg2
import boto3

resource = boto3.resource('s3')
conn = psycopg2.connect(dbname=db, user=user, password=pw, host=host)
cur = conn.cursor()

def copyFun(bucket, select_query, filename):
    query = f"""COPY {select_query} TO STDIN \
            WITH (FORMAT csv, DELIMITER ',', QUOTE '"', HEADER TRUE)"""
    file = io.StringIO()
    cur.copy_expert(query, file)
    resource.Object(bucket, f'{filename}.csv').put(Body=file.getvalue())
Tsingis
  • 508
  • 3
  • 8
  • 25
  • Thanks, this worked like a charm. Is there any way to gzip and store the file in S3 incase the file is large?? – Tejaswa Aug 27 '19 at 11:24
  • 1
    @Tejaswa Not exactly the answer but check this out for ideas https://stackoverflow.com/questions/41388555/psycopg2-copy-expert-how-to-copy-in-a-gzipped-csv-file – Tsingis Aug 27 '19 at 16:04
2

We do following in our case, performance wise, its pretty fast, and scheduled method rather then continuous streaming. I'm not 100% sure if its wise method, but definitely good from speed prospective in case of scheduled data exports in CSV format that we eventually use for loading to d/w.

  • Using shell script, we fire psql command to copy data to local file in EC2 App intance.

    psql [your connection options go here] -F, -A -c 'select * from my_schema.example' >example.csv

  • Then, using shell script, we fire s3cmd command to Put example.csv to designated S3:bucket Location.

    s3cmd put example.csv s3://your-bucket/path/to/file/

Red Boy
  • 5,429
  • 3
  • 28
  • 41
2

This is an old question, but it comes up when searching for "aws_s3.export_query_to_s3", even though there is no mention of it here, so I thought I'd throw another answer out there.

This can be done natively with a Postgres extension if you're using AWS Aurora Postgres 11.6 or above via: aws_s3.export_query_to_s3

Exporting data from an Aurora PostgreSQL DB cluster to Amazon S3 https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html

See here for the function reference: https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html#postgresql-s3-export-functions

This has been present since Aurora for Postgres since 3.1.0 which was released on February 11, 2020 (I don't know why this URL says 2018): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Updates.20180305.html#AuroraPostgreSQL.Updates.20180305.310

I would not recommend using 3.1.0/11.6, however, there is a bug that causes data corruption issues after 10MB of data is exported to S3: https://forums.aws.amazon.com/thread.jspa?messageID=962494

I just tested with 3.3.1, from September 17, 2020, and the issue isn't present, so, anyone who wants a way to dump data from Postgres to S3... and is on AWS, give this a try!

Here's an example query to create JSONL for you.

JSONL is JSON, with a single JSON object per line: https://jsonlines.org/

So you can dump a whole table to a JSONL file, for example... You could also do json_agg in postgres and dump as a single JSON file with objects in an array, it's up to you, really. Just change the query, and the file extension, and leave it as text format.

select * from aws_s3.query_export_to_s3(
    'select row_to_json(data) from (<YOUR QUERY HERE>) data',
    aws_commons.create_s3_uri(
        'example-bucket/some/path',
        'whatever.jsonl',
        'us-east-1'),
    options :='format text');

For CSV, something like this should do the trick:

select * from aws_s3.query_export_to_s3(
    '<YOUR QUERY HERE>',
    aws_commons.create_s3_uri(
        'example-bucket/some/path',
        'whatever.csv',
        'us-east-1'),
    options :='format csv');
Gordo
  • 778
  • 1
  • 6
  • 10